IEEE SP Sukittanon03

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 10, OCTOBER 2004 3023

Modulation-Scale Analysis for Content IdentificationSomsak Sukittanon, Member, IEEE, Les E. Atlas, Fellow, IEEE, and James W. Pitton, Member, IEEE

Abstract—For nonstationary signal classification, e.g., speech ormusic, features are traditionally extracted from a time-shifted, yetshort data window. For many applications, these short-term fea-tures do not efficiently capture or represent longer term signalvariation. Partially motivated by human audition, we overcome thedeficiencies of short-term features by employing modulation-scaleanalysis for long-term feature analysis. Our analysis, which usestime-frequency theory integrated with psychoacoustic results onmodulation frequency perception, not only contains short-term in-formation about the signals, but also provides long-term informa-tion representing patterns of time variation. This paper describesthese features and their normalization. We demonstrate the effec-tiveness of our long-term features over conventional short-termfeatures in content-based audio identification. A simulated studyusing a large data set, including nearly 10 000 songs and requiringover a billion audio pairwise comparisons, shows that modulation-scale features improves content identification accuracy substan-tially, especially when time and frequency distortions are imposed.

Index Terms—Audio fingerprinting, audio identification, audioretrieval, auditory classification, content identification, featureextraction, feature normalization, long-term features, modulationfeatures, modulation scale, modulation spectrum, pattern recog-nition, short-term features, 2-D features.

I. INTRODUCTION

I N PATTERN recognition, the goal is to have a machine char-acterize events and make an appropriate decision about their

pattern. Pattern recognition has been applied to many problemssuch as recognition, classification, or identification of a wave-form or image. Short-term estimates, for example Fourier orautocorrelation coefficients, have been widely used for manysignal types (e.g., [1]). As shown in the top panel of Fig. 1, basi-cally the signal is blocked into successive frames using a smalldata analysis window. This blocking assumes stationarity of thesignal within each frame. The windowed signal is then trans-formed into a new representation, e.g., spectral domain, hope-fully giving good discrimination and/or energy compaction. Thelength of a short-term analysis window can be different de-pending on the signal type. For example, in tool-wear classifica-tion [2], stationarity can be assumed within a quarter revolutionof a tool. Since a working tool has different sizes and rotationspeeds, a quarter revolution time window results in differenttime durations, 20–40 ms. In speech recognition, the typicalshort-term window used for speech analysis is about 20–30 ms.

Manuscript received February 17, 2004. This work was supported in part bythe Washington Research Foundation. The associate editor coordinating the re-view of this manuscript and approving it for publication was Dr. Ton A. C. M.Kalker.

S. Sukittanon and L. E. Atlas are with Department of Electrical Engi-neering, University of Washington, Seattle, WA 98195-2500 USA (e-mail:[email protected]; [email protected]).

J. W. Pitton is with the Applied Physics Laboratory, University of Wash-ington, Seattle WA 98105-6698 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSP.2004.833861

In music classification, the length of a short-term window is in-fluenced by audio coding, where typically two processing win-dows lengths are used [3]. For audio with a sampling rate of 48kHz, windows of size 256 samples, or about 5 ms, and 2048samples, or about 43 ms, are commonly applied. Unless speci-fied otherwise, we will refer a data window of length less than50 ms as a “short-term window.”

One of the mathematical models for nonstationary signals(e.g., speech, music, or communication signals) used previ-ously [4, p. 427] and considered in this paper is the product ofa narrow bandwidth, stationary lowpass modulating randomprocess and the high-pass carrier, a deterministic function

(1)

For this model to be accurate, is assumed to be nonnegativeand its bandwidth does not overlap with that of . The abovemodel has been applied to encode speech [5] and audio signals[6]. Motivated by this success in coding, this model can also beuseful for pattern recognition. An important question is whethermore traditional feature extraction such as short-term spectralanalysis is adequate for extracting the pertinent information ofthis model. Since is a slowly varying signal, using too shortof an analysis window can be insufficient to model .

Understanding the human perceptual system and integratingthe relevant aspects significantly contributes to the under-standing of this form of feature extraction. For example, in [7],to estimate energy at the most sensitive modulation frequencyof human audition, about 4 Hz, an analysis window of at least250 ms is needed. A concept related to the perceptual durationof the auditory system is that of a pre-perceptual auditoryimage [8]. This concept refers to the process where an auditoryinput produces an auditory image containing information abouta stimulus. Because of the continuous change of the auditoryinput, a pre-perceptual auditory store is used to hold informa-tion about the stimulus and can be utilized later. Massaro [8]estimated the pre-perceptual auditory storage and processing tobe over 200 ms, which is again longer than a typical short-termanalysis window. Finally, the sensitivity of short-term featuresto noise [9] and unseen testing data [2] is another deficiency.

To improve the deficiencies of short-term feature analysis,we propose long-term feature analysis using modulation spec-tral analysis, as shown in the bottom panel of Fig. 1. The newlong-term analysis using an analysis window much longer than50 ms contains short-term information about the signal as wellas long-term information representing patterns of time variation.To validate our approach, content-based audio identification isused for experimental verification. The simulated study showsthat our approach improves the recognition performance whenthe distortions, both time and frequency, occur.

1053-587X/04$20.00 © 2004 IEEE

3024 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 10, OCTOBER 2004

Fig. 1. Block diagram contrasting our long-term feature analysis with conventional short-term feature analysis.

II. JOINT FREQUENCY ANALYSIS

A. Motivation

There is substantial evidence that important time-varying in-formation is contained in the temporal modulation of the sig-nals. Early evidence came from the development of the channelvocoder [10]. The channel vocoder modeled signal productionof speech as a narrow bandwidth lowpass process that mod-ulated a higher bandwidth carrier. The results of the channelvocoder suggested that the important information for speech in-telligibility was contained in low modulation frequencies, withbandwidth about 25 Hz. The concept of predicting signal in-telligibility, especially speech signals, from modulation spectraalso came from concepts of the modulation transfer function andthe speech transmission index [11]. These analyzes can be usedto determine the effect of multiple forms of acoustic distortion,e.g., noise or reverberation, and motivated research [12], [13]on modulation spectra for automatic speech recognition. It hasbeen suggested that the modulation frequencies in speech from2 to 8 Hz reflect syllabic and phonetic temporal structure. Inhuman perception, many studies, e.g., [14], [15], have demon-strated the ability of listeners to detect and discriminate mul-tiple sources of sinusoidal amplitude modulation. Studies of au-ditory frequency selectivity for amplitude modulation showedthat a log frequency scale, with resolution consistent with a con-stant- over the entire range, best mimics human perceptionof modulation frequency [16]. Auditory modulation filters withfrequencies up to 64 Hz are approximately constant- , with avalue of about 1.

Taken together, this evidence suggests that modulationspectra with low bandwidth are important for signal intelligi-bility. How, then, can modulation spectra be estimated in a formuseful for classification? We discuss two techniques related tomodulation spectrum in classification: 1) cyclostationary anal-ysis [17] and 2) modulation spectrogram analysis [12]. Thesetwo analyses are based on a two-dimensional (2-D) representa-tion of the acoustic and modulation frequencies, which we refer

to as a joint frequency representation. Cyclostationary analysisprovides a theoretical framework for estimating a precise jointfrequency representation, but requires very long signal samplesand a prior estimate of periodicity (the fundamental frequency).Unlike cyclostationary analysis, the modulation spectrogramanalysis does not require a prior estimate of periodicity. Themodulation spectrum has been applied to the recognition ofspeech in reverberant environments [12]. Since modulationspectra are the result of several engineering approximations,a theoretical framework is lacking. Also, both cyclostationaryand modulation spectra do not directly incorporate knowledgeof human perception of modulation. To compensate for earlierdisadvantages, there are several needs: First, the proposedanalysis should be practical for unknown signals, that is, it doesnot require any prior information. Second, it should have atheoretical framework. One purpose of a theoretical frameworkis to understand how the representation changes when commonsignal modifications, such as channel distortions, occur. Third,the proposed analysis should include support from psychoa-coustics research on modulation frequency perception. Tosatisfy these three desirable aspects, modulation-scale analysisis proposed in this work. In the next section, we discuss thetheoretical framework that will be used for our method.

B. Background of Joint Frequency Analysis

One possible joint frequency representation, , is atransform in time of a demodulated short-time spectral estimatewhere, for the purpose of this paper, and are “acoustic fre-quency” and “modulation frequency,” respectively. A spectro-gram, or other joint time-frequency representations [18], canbe used as the starting point of this analysis. In this paper, wefirst use a spectrogram with an appropriately chosen windowlength to estimate a joint time-frequency representation of thesignal . Second, another transform (e.g., Fourier) isapplied along the time dimension of the spectrogram to esti-mate . Another way of viewing , as shown

SUKITTANON et al.: MODULATION-SCALE ANALYSIS FOR CONTENT IDENTIFICATION 3025

Fig. 2. Joint frequency analysis for the speech utterance “the second”. (a) Spectrogram of the utterance shows high energy values in the 500- and 1500-Hz acousticfrequency subbands. (b) Joint frequency representation of this speech exhibits more compaction in the same acoustic frequency subbands.

in (2), is the convolution in and multiplication in of the cor-relation function of a Fourier transform of the signal andthe underlying data analysis window

(2)

To illustrate the behavior of , an AM signal is used.Equation (3) describes a spectrogram of the AM signal, where

and are the modulation and carrier frequencies, respec-tively

(3)

When a Fourier transform is applied along the time dimensionof , it yields

(4)

results in the compaction of nonzero terms occurringat low , , which can be advantageous for coding andclassification. When using a joint frequency representation asfeatures in signal classification, only positive and low modula-tion frequencies are needed for estimating the low-dimensionalfeatures because of the symmetric property in modulation fre-quency and small nonzero support region of . Themodulation frequency range of interest in can be de-termined by the assumed highest modulation frequency in thesignal or the bandwidth of the chosen spectrogram window.

C. Interpretation for Speech and Music Signals

values along are the convolution of thespectra of the spectrogram window and the signal

(5)

The values of lying along are an averagedshort-term spectral estimate of the signal. The length of the spec-trogram window and the amount of overlapping determine thetradeoffs between the bias and variance of the short-term spec-tral estimate. at is an estimate of stationarysignal content while the at is an estimate of thenonstationary information about the signal. This nonstationaryinformation can represent various quantities (e.g., the symbolrate of a digital communications signal [19]). An example by amale speaker is illustrated in Fig. 2(a), which shows the spectro-gram of the speaker saying “the second”. The high energy termsmostly occur in 500 and 1500 Hz of the acoustic frequency sub-bands. The joint frequency representation of this signal, as illus-trated in Fig. 2(b), shows the high-energy values concentratedin low , 1–20 Hz, and high , 120 Hz, of the same acousticfrequency subbands. Since we use an analysis window that isshorter than a normal pitch period, the nonzero terms at highhave provided the pitch information of this speaker. If the anal-ysis window is longer than a pitch period, the pitch informationwill be smoothed out. The nonzero terms occurring at low re-flect the syllabic and phonetic temporal structure of the speech[13].

For music signals, the periodicity in the signal causingnonzero terms in the joint frequency representation can resultfrom the beat and rhythm. The beat is the approximately uni-form spacing in time of the impulses driving the tempo in themusic while the rhythm can be a uniform pattern over a longperiod of time. The beat information is usually represented asthe high value terms in at low , ranging from 0.5to 5 Hz, which corresponds to about 30 to 300 beats per minute(bpm). The fundamental frequency of an instrument such as abass guitar instead contributes to the high value terms occurring


Fig. 3. Joint frequency analysis for a music signal. (a) Spectrogram of a music signal containing both bass guitar and a drum shows uniform spacing impulses.(b) Joint frequency representation of this signal exhibits high value terms at very low, 2–3 Hz, and high, 45 Hz, modulation frequency.

at high . Fig. 3(a) shows a spectrogram of a 2-s segment ofpop music. We can see the high energy terms occurring at lowacoustic frequency resulting from a drum beat. A repetitivepattern occurs about five times during 2 s, corresponding to atempo of about 150 bpm. in Fig. 3(b) reveals theinformation of the tempo at low , about 2–3 Hz.also shows the high energy values at 45 Hz in the dimensionresulting from a pitch frequency of a bass guitar.

As demonstrated with these signals, joint frequency analysishas the potential to extract time-varying information via thenonzero terms in the representation. These nonzero terms arepossibly useful for discriminating signal types, thus they shouldbe considered as useful features. However, usingdirectly for classification has an important disadvantage.

provides an extremely large dimension compared totraditional short-term spectral estimates. Even though we canreduce the feature dimension due to the symmetry in frequencyand small nonzero support region of , the resultingdimensions are still too large for typical classifiers. Past re-search has addressed the method of reducing feature dimensionof a 2-D representation in various ways. For example, manymethods view a 2-D signal representation as an image. Thenonzero terms lying in the representation then can be viewedas the lines or objects and the small set of descriptors beinginvariant to translation, rotation, or scaling can be extracted.Since we are interested in tasks where human auditory signalclassification is largely successful, integrating psychoacousticresults into the analysis can possibly provide added advantagesin feature design and selection.

In this paper, we propose a new method for joint frequencyanalysis, called modulation-scale analysis. This proposed anal-ysis use the above time-frequency theory integrated with psy-choacoustic results on modulation frequency perception to pro-vide a foundation and subsequent joint frequency analysis fora classification system. In the next section, we will discuss thetheory of the proposed modulation-scale analysis and show howthe proposed analysis is more appropriate for signal classifica-tion than the above .

III. MODULATION SCALE ANALYSIS

A. Theory

Using Fourier analysis for the modulation frequency trans-form in the above analysis results in a uniform frequency band-width in modulation frequency; however, this approach for mod-ulation decomposition can be inefficient for auditory classifica-tion due to the resulting high dimensionality. Furthermore, theuniform bandwidth in modulation frequency is far from humanauditory perception. Recent psychoacoustic results [16] suggestthat a log frequency scale, with resolution consistent with aconstant- over the whole range, best mimics human percep-tion of modulation frequency. Our approach uses a continuouswavelet transform (CWT) to efficiently approximate this con-stant- effect, though we can also achieve constant- by di-rectly smoothing the Fourier coefficients. The joint acoustic andmodulation-scale representation, which we refer to as “modula-tion scale,” is the joint representation of the acoustic frequencyand modulation frequency with nonuniform bandwidth. We willshow that our proposed analysis can provide much lower fea-ture dimensionality, yet maintains good discrimination whenapplied to multicomponent signals. Modulation-scale analysisstarts with a standard spectrogram

(6)

For discrete scale , the wavelet filter is applied along eachtemporal row of the spectrogram output

(7)

The above equation can be viewed as filtering the temporal en-velope in each acoustic frequency subband. The scaling term

serves to normalize the passband magnitude of each filter,which is different from the standard CWT normalization, where

is used to normalize the energy of each filter. After com-puting , its energy is integrated across the wavelettranslation axis to produce a joint frequency representation


Fig. 4. Joint frequency analysis of an AM signal with two modulation frequency components. (a) Fourier modulation decomposition gives the equally uniformbandwidth in modulation dimension. (b) The joint frequency representation of the multi-component AM using Fourier modulation decomposition. (c) Waveletmodulation decomposition gives the constant-Q bandwidth in modulation dimension. (d) The joint frequency representation of the multicomponent AM usingCWT.

with nonuniform resolution in modulation frequencydimension, as indexed by the discrete scale

(8)

To demonstrate the advantage of this constant- bandwidth,an AM signal with fixed carrier frequency of 2756 Hz andtwo modulation rates, 3 and 24 Hz, was simulated. Fourierand constant- modulation frequency decompositions wereused in the experiments. Ideally, the modulation frequencydimension should show separate regions representing the twodistinct modulation frequencies. With the same total dimen-sionality, eight frequency subbands for a range of 256 Hz inmodulation frequency, using Fourier transform for modulationdecomposition yields 32-Hz bandwidth equally as shown inFig. 4(a), while using the CWT modulation decomposition canyield a bandwidth of 16 Hz, as shown in Fig. 4(c). The Fouriermodulation decomposition applied to the AM signal with twomodulation rates cannot resolve these two modulation frequen-cies as illustrated in Fig. 4(b). Instead, constant- modulationdecomposition such as that achieved with the wavelet transformin Fig. 4(d) provides better distribution of frequency resolutionin modulation frequency by showing two distinct nonzeroterms. Next, we will discuss the method to reduce sensitivity ofmodulation spectral features to distortions.

B. Channel Compensation

Signal-distortion channel transmission effects, e.g., re-verberant or linear filtering, and noise are common factorscausing a change of signal characteristics between trainingand testing data. Accuracy results usually degrade significantlyif the environmental characteristics, e.g., transducers, noises,or acoustic environments, of the training and testing data aredifferent. Past research has addressed this problem by applyingfeature normalization, e.g., [20], [21]. Feature normalizationis applied after feature extraction and usually uses informationfrom several feature samples to make both training and testingfeatures insensitive to environmental characteristics. Cepstralmean subtraction (CMS) is an example of a method usedfor removing convolutional noise, e.g., linear time-invariantfiltering, from the signal.

1) Cepstral Mean Subtraction: If the signal is distorted by aconvolutional noise, the distorted time signal can be mod-eled as the convolution between the clean signal and thechannel . Cepstral features are estimated from the logarithmof the estimated power spectrum of the signal; therefore, cep-stra turn the multiplication of frequency response of the channelcharacteristic and signal into a summation

(9)


For CMS, assuming the channel is linear and time invariantand the length of the impulse response is shorter than thedata window , normalized cepstral features are obtained bymodifying the cepstral feature in each feature dimension by sub-tracting its long-term average mean

(10)

It has been shown that CMS significantly improved the accuracywhen the channel distortions of training and testing data differedby convolution noise [21].

2) Subband Normalized Modulation Spectrum: Previouswork [22] using modulation spectral features in classificationused mean and variance normalization for feature normaliza-tion. We will show that subband normalization is appropriatefor channel compensation of modulation spectral features. Themodulation-scale features along with this feature normalizationare insensitive to convolutional noise, like CMS features, yetare also insensitive to other common distortions such as timedistortion. First, we consider how channel distortion affects thejoint frequency representation. The frequency response of theoutput of the linear time-invariant filter is modeled by themultiplication of the frequency response of the input, , andchannel, . If we assume, as with the CMS technique, thatthe channel is linear and time invariant over an analysis window,we can approximate the spectrogram of the output as the mul-tiplication of the spectrogram of the input and the frequencyresponse of the channel: . Sincethe smoothed joint frequency representation is a transform inthe time dimension, the smoothed joint frequency representa-tion of the output can be written as

(11)

The above equation can be viewed as the uniform amplificationor attenuation of all modulation frequencies for the associatedacoustic frequency. To remove this effect, subband normaliza-tion is proposed. By normalizing each acoustic frequency sub-band with the marginal of the modulation frequency representa-tion, the normalized joint frequency representationis independent of the channel

(12)

For the modulation-scale representation, using (7) and (8), themodulation-scale of the output can be approximatedby

(13)

When the proposed channel compensation is applied to the mod-ulation-scale representation, the result is approximately inde-pendent of the channel

(14)

Besides linear time-invariant filtering, we also consider timedistortions, specifically time scaling and shifting, in this paper.Time-scale modification can result from the change of artic-ulation rate of the speaker for speech signals or speeding up(or slowing down) the playback rate with pitch preservationfor music signals. If the signal is distorted via a time-scalefactor , it results in the compression or expansion of the timesignal without changing its acoustic frequency characteris-tics [20]. The spectrogram of a time-scale modified signal

can be related to a spectrogram of the distortedsignal by . Using (7), the effectof after applying wavelet filters to the temporal rows ofa spectrogram is a linear change of scale and translation

. A further effect of , afterintegrating the energy across the wavelet translation dimension,is a linear change in and amplitude scaling by

(15)

However, we use dyadic scales , in which an increase of thescale index by one corresponds to a doubling of the scale valuefrom the adjacent lower scale. With the assumption of a smalltime-scale modification , the time-scaling effect can beapproximated by

(16)

that is, a simple amplitude scaling of . Subse-quent subband normalization can then compensate for thetime-scaling effect

(17)

This result is also independent of . Although a Fourier trans-form with large bandwidth is an alternative for time-scale mod-ification compensation, the wavelet transform provides bettermanagement of the tradeoff in frequency resolution occurringwith physical signals, as shown in Fig. 4.

Another time distortion is time misalignment. If the signal isshifted in time by , the spectrogram of the time-shifted signal becomes . Using (7),the effect of is a shift in after applying wavelet filters tothe temporal rows of the spectrogram output:

. If the time shift is relatively small com-pared to the integration time in (8), then the estimation is ap-proximately invariant to the effect

(18)


Fig. 5. Comparison between clean (solid lines) and distorted (dotted lines) features applied with EQ. Cepstral energy features (a) without and (b) with channelcompensation using mean subtraction. Modulation-scale features (c) without and (d) with channel compensation using subband normalization.

Compared to CMS, cepstral features are the representation oftime and time-lag; therefore, CMS can only remove a slowlyvarying convolutional noise, not time distortions. Equation (19)illustrates the effect of time-scale modification and time shift toCMS

(19)

For the purpose of signal classification, we define a modu-lation-scale feature matrix , where the modulationfrequencies are estimated by wavelet filters with andrepresenting the dyadic scale and discrete acoustic frequency,respectively. To initially demonstrate the advantages and gen-eralization of the proposed features and channel compensation,subband normalized modulation-scale and cepstral energyfeatures with mean subtraction of audio signal were comparedusing a 4-s clip of music. The distorted features were extractedfrom an audio passage modified with EQ, as used for convolu-tional noise, and with time compression, as used for time-scalemodified noise. Ceptral energy features, the first ceptral featuredimension or , were extracted using 46 ms short-termwindow with 10-ms frame rate. Modulation-scale featureswere extracted using eight dyadic wavelet filters and 19 barkfilterbanks (more design details will be addressed in the nextsection). Since is a 2-D feature, it was vectorizedprior to comparison. For EQ, the results before features nor-malization, both mean subtraction and subband normalization

as illustrated in Figs. 5(a) and 5(c), respectively, show thesignificant difference between clean (solid line) and frequencydistorted (dotted line) features. The goal after feature normal-ization was to obtain a similar representation of both clean anddistorted features. As shown in Figs. 5(b) and 5(d), both CMSand subband normalization yield good results in normalizingconvolutional noise, as was previously discussed. There was nosignificant difference of clean (solid line) and distorted (dottedline) features. The correlation between clean and distorted was0.99 for both techniques. For time compression, as shown inFig. 6, subband normalization still preserves the good resultsof Fig. 6(d); however, CMS shows a significant differencebetween clean and distorted features in Fig. 6(b) after channelcompensation. For cepstrum energy with mean subtraction, thecorrelation between the solid and dotted line reduced to 0.56while the correlation of normalized modulation-scale featuresstill maintained correlation of 0.99. From this example, weconclude that modulation-scale features with subband nor-malization show less sensitivity to channel distortions, bothfrequency EQ and time scaling, than CMS features.

IV. APPLICATION

A. Content-Based Audio Identification

In this section, we discuss the use of modulation-scaleanalysis in content-based audio retrieval, specifically in audioidentification. In this application, the system is able to identify


Fig. 6. Comparison between clean (solid lines) and distorted (dotted lines) features applied with time-scale modification (time compression). Cepstral energyfeatures (a) without and (b) with channel compensation using mean subtraction. Modulation-scale features (c) without and (d) with channel compensation usingsubband normalization.

unknown music, which can be played back in different envi-ronments and with different alterations. Audio fingerprintingtechnology has been proposed for reporting property rightsor monitoring radio broadcasts [23], [24]. In this approach, acomputer or other electronic device that has access to a centraldatabase of statistics automatically labels unlabeled music. Anaccurate fingerprint can ideally identify audio content basedsolely on acoustic properties. Moreover, each song needs to berepresented compactly. The audio data used in this applicationare commonly extracted from commercial compact discs (CDs)with sampling rate 44.1 kHz. The database contains a widevariety of music genre with a short duration for each passage.One of the requirements for the audio fingerprinting system, onwhich we focus in this work, is robustness to signal distortion.Transmission channels and users alter recordings in manyways that still leave the audio useful. Allamanche et al. [25]suggested a list of possible degradations that exist in practicalapplications. For example, suggested frequency distortions arelow bit rate perceptual audio coding and EQ. Random timeshifts, time-scale modification, and dynamic range compressionare suggested for time distortions. A successful system musthave features that do not change significantly under commondistortions or modification. Previous research into acoustic fea-tures for audio identification, which were adopted from audioretrieval, has focused on using the combination of perceptualfeatures derived from short-term spectral features such as thecepstrum, the spectrum centroid, the spectrum bandwidth, or

the energy of the signal [26]–[29]. While these features appearquite promising, they are not robust to certain distortions.

A review of an audio fingerprinting system can be found in[30]. There has also been previous research on feature extrac-tion for audio identification. For example, Haitsma et al. [31]used a bit string feature known as a hash table. The advantageof a hash table is small feature dimension and fast table lookup.Burges et al. [32] used a long-term analysis window, about 6.1 s,for extracting a noise-robust audio feature. The proposed long-term features showed robustness to unseen distortions. Peeterset al. [33] improved music summarization and fingerprintingusing dynamic features derived from long-term feature anal-ysis. Mihcak and Venkatesan [34] used a new method exploitinglong-term correlation in both time and frequency dimensions forfeature extraction. Their approach provided better performancethan the methods exploiting either time or frequency correla-tion only. Fragoulis et al. [35] used features based on Fourierspectral energy for identifying audio retrieved from several CDsand radio stations. Allamanche et al. [25] instead used sev-eral short-term features such as loudness, spectral flatness, andsharpness. While the accuracy of each feature was quite highwhen testing with separate distortions, this work also suggestedthat short-term features failed to provide high recognition ratesunder both time and frequency distortions.

We will show, via the experiments in the next section, thatnormalized modulation-scale features are more robust to dis-tortions than short-term features with mean subtraction in con-


TABLE IEXPERIMENT SETUP FOR AUDIO FINGERPRINTING

tent-based audio identification. These new normalized featurescan maintain excellent labeling (classification) accuracy undercommon unintentional and intentional distortions.

B. Data Collection

Actual audio files were used in this work, taken from a largedatabase with 9959 songs. The database contained a wide va-riety of music genres with 15-s passages, each starting approxi-mately 30 s from the beginning of the song. From a total of 41.5h of music, we divided the database into two sets of independentsongs, drawn from independent CDs. The distorted passages cre-ated from the first set, consisting of 3393 distinct songs, were adesign set. This set was used to empirically optimize free designparameters such as a threshold. The distorted passages createdfrom the second set, consisting of 6566 distinct songs, were atest set used to measure the accuracy of the method.

C. Robustness to Distortions

During the design, we chose a subset of degradations listed inTable I for a total of five types of distortion. Two different typesof frequency equalizers including default presets (pop and rock)from Winamp were used for frequency distortions. For time dis-tortion, a 1-s time shift and two types of time-scale modificationwere considered. Time-scale modification was performed byscaling the audio signal with 5% deviation without alteringthe pitch using Sound Forge. During testing, we applied fourmore unseen distortions for a total of nine distortions, as listed inTable IV. To compare the performance of short-term features andour proposed long-term features, two types of features will becompared and reported: 1) mel-frequency cepstral coefficients(MFCC) with mean subtraction as feature normalization and 2)modulation-scale features with subband normalization as featurenormalization. Note that both sets of features are designed tohave the same dimensionality. The summary of a complete ex-periment for audio fingerprinting is shown in Table I. The detailsof short-term feature processing can be found in [26] and [28].

D. Feature Extraction and Classification

Each song was represented by a 15-s time-scale digital audiopassage. This passage was resampled to 11 025 Hz and con-verted, via an average of the left and right channels, to mono.The resampled audio was windowed into multiple frames with a4-s frame length and 1-s frame rate, corresponding to 12 framesfor each passage. Subband filtering was performed to produceacoustic frequency using a short-time Fourier transform. Sincethe bark frequency scale has been used in many audio applica-tions to mimic the reduced number of channels of the human au-ditory system, combinations of linear acoustic frequency chan-nels were used to reduce feature size. With the 11 025-Hz sam-pling rate, there were 19 bark-spaced subband filters. The enve-lope in each subband was detected by a magnitude square op-erator. A lowpass filter was then used to remove the aliasingbefore the sampling rate was reduced to 512 Hz. To reduce theinterference of large dc components of the subband envelope,the signal was demeaned before modulation frequency estima-tion. After demeaning, biorthogonal wavelet filters with eightdifferent dyadic scales were used to produce one modulationfrequency vector for each acoustic subband. After processingall frames, the 15-s time-scale passage was thus represented by12 frames of a 19 8 vector of features, our proposed audio fin-gerprint. The MFCC features used in the comparison were basedon work in [28]. A short-term analysis window with length of46 ms was applied. For each 4-s window, after generating 19MFCC dimensions and subtracting out the mean in each featuredimension independently, features in each dimension were as-signed into 8 histogram bins which were pre-computed from thefirst 1000 songs of the database. Since these histogram featuresremove the time order, they significantly alleviate the time dis-tortion effects, time shift and time-scale modification. Finally,the 15-s time-scale passage was also represented by 12 framesof 19 8 vector of MFCC features.

Classification of a test signal was performed after featureextraction. The testing pattern to be recognized was com-


TABLE IIPERCENT DETECTION WITH A FIXED FALSE-ALARM RATE 0.0001% OF MFCC AND MODULATION-SCALE FEATURES WITH PERFECT TIME ALIGNMENT

pared to the reference templates. A distortion measurement,e.g., Euclidean, can be applied to measure the similarity be-tween an unknown feature and the reference features. UsingEuclidean distance implicitly assumes the features are uncor-related Gaussian with identical variance in each feature di-mension, which may not be a reasonable assumption. Due tothe information-theoretic optimum [36] and the computationaladvantages, a cross entropy approach was considered in thiswork. As derived in [9], we define a feature set representinga test song is , where

are the modulation-scale feature matrix for a timeframe at time and is the total number of frames, or anintegration time, in an audio passage used to represent a song.For passages in the database, we also define the feature setof the reference audio aswhere is the ordered index in the database and arethe reference feature matrix at time . By using a cross-en-tropy-based template match, the distortion between the testsong and the database becomes

(20)

whereand

. Since the distortion value is nonnegative, asimilarity value, bounded between 0 and 1, can be determinedby exponentiation of the distortion score .

E. Experimental Results of the Design Set

First, we assume that there was perfect time alignment.16 965 songs were used to test the performance. Three dif-ferent integration times, 5, 10, and 15 s, corresponding to2, 7, and 12 frames, were evaluated. The similarity scoresof inter-song (same song regardless of any distortion) andintra-song (different songs from all distortions) were used to

find a receiver operating characteristic (ROC). For a fixed falsepositive rate of 0.0001%, the percent detection of each featurefor each distortion and time integration was shown in Table II.Due to histogram processing, the effect of time-scale modifi-cations was alleviated from MFCC features making the resultsof both features comparable in most cases. Next, we appliedthe more realistic situation when there was time misalignment.The same dataset with 1-s time shift were added. For a fixedfalse positive rate of 0.0001%, the percent detection of eachfeature is shown in Table III. When the integration time wassmall, or the ratio of time shift to integration time was high,time misalignment decreased the detection results significantly.As shown in Table III, the performance of modulation-scalefeatures was better than MFCC features in all cases, espe-cially at a small integration time. At false-alarm rate 0.0001%,the percent detection of different integration time, from 4 to15 s, with all distortions was shown in Fig. 7. The proposedlong-term features yielded higher detection results than con-ventional short-term features in all integration times. Althoughthe time shift effect decreased with increasing integration time,modulation-scale features provided insignificant improvementafter 10 s. Fig. 8 shows the ROC curve at 10-s integrationtime for both features with and without time shift. When timemisalignment was included, the percent detection significantlydecreased for MFCC features, from 97.5% to 90.0% comparedto modulation-scale features, from 97.5% to 96.7%. For sub-sequent testing, the rejection threshold at a false-alarm rate of0.0001% and 10-s integration time was chosen. The thresholdswere 0.8994 and 0.5647 for MFCC and modulation-scale fea-tures, respectively.

F. Experimental Results of the Test Set

During the test, 6566 clean songs were used to generatethe distorted songs. Some distortions were further includedfor testing the robustness to unseen noise. Two low-bit-rateperceptual audio codecs, MP3 and Window Media Audio, at64 kbps were considered. Another classical preset of frequencyequalizer and dynamic range normalization were also added.After generating all the distortions using a 10-s integration


TABLE IIIPERCENT DETECTION WITH A FIXED FALSE-ALARM RATE 0.0001% OF MFCC AND MODULATION-SCALE FEATURES WITH ONE SECOND TIME SHIFT

Fig. 7. Comparison of percent detection between modulation-scale andMFCC features when time, frequency, and 1-s time misalignment distortionswere applied.

time, there were in total 118 188 songs, or 328.3 h. To comparethe performance of short-term and long-term features, thepercent detection of MFCC and modulation scale was com-puted using the earlier rejection threshold. Table IV showedthe comparison of both features for each distortion. Again,modulation-scale features yielded a higher detection rate thanMFCC features in all distortions when time shift was included.Using all distortions, the percent detection of MFCC featureswas 88.2% whereas modulation-scale features was 97.4%, ora 78.0% reduction in error rate.

One of the attempts to reduce the computational searchburden in the nearest neighbor classification is partial distance.By computing only a subset of the features, if the accumu-lated distortion is above the threshold, then the search stops.Applying to modulation-scale features, we can estimate thesimilarity for each acoustic subband independently, unlike Eu-clidean distance where each feature is independent. Given thesimilarity threshold of modulation-scale features,in these experiments, if the distortion of the current acoustic

Fig. 8. ROC curve of modulation-scale and MFCC features at 10-s integrationtime with all distortions.

(bark) frequency subband is above the updatedthreshold then the comparison stops

(21)

Fig. 9 illustrates the distribution of the acoustic (bark) frequencysubband numbers upon reaching the stopping condition, from

audio pairwise comparison. The result showedthat 99.9% rejection occurred in the acoustic frequency rangebelow 1080 Hz, the ninth acoustic (bark) frequency subband.The above partial distance method reduced the computationalcomplexity by a factor of eight over a full search. In thiswork, our focus was to evaluate an objective performance,however it is possible to implement this system in real-timeusing fast algorithms, e.g., real-time modulation spectrum [6]or bit-vector indexing [37]. Future work includes testing withdifferent distortions, e.g., background noise, speech CODEC,and pitch shifting.


TABLE IVPERCENT DETECTION OF MFCC AND MODULATION-SCALE FEATURES USING THE EARLIER REJECTION THRESHOLD

Fig. 9. Distribution of the acoustic (bark) frequency subband numbers uponreaching the stopping condition from all testing songs.

V. CONCLUSIONS

This paper has discussed a new approach to modulationspectra. The proposed long-term feature analysis, which ispartially motivated by the human auditory system, can po-tentially be applied to nonstationary signals such as speech,music, or communication signals. There are two key practicalcontributions proposed in this paper: 1) nonuniform modula-tion frequency decomposition and 2) channel compensation of

modulation spectral features. Compared to uniform modulationfrequency decomposition, the new method has shown advan-tages of low dimensionality and high discrimination of distinctmodulation frequencies. The proposed channel compensationof modulation-scale features, subband normalization, has alsodemonstrated advantages of insensitivity to frequency and timedistortions over cepstral mean subtraction. The experimentin content identification with over 1 billion audio pairwisecomparison showed that conventional MFCC features yielded88.2% correct detection whereas modulation-scale featuresyielded 97.4% correct detection. Our analysis shows thatincorporating modulation-scale features into the automaticclassification system substantially improved the accuracy. Theresulting approach has potential for a wide range of acousticclassification or detection applications such as speaker identi-fication/verification, audio classification and segmentation, ordigital communication signal recognition.

ACKNOWLEDGMENT

One of the authors, S. Sukittanon, would like to thankDr. C. J. C. Burges for useful discussions.

REFERENCES

[1] D. H. Kil and F. B. Shin, Pattern Recognition and Prediction With Ap-plications to Signal Characterization. Woodbury, NY: AIP, 1996.

[2] R. K. Fish, “Dynamic models of machining vibrations, designed for clas-sification of tool wear,” Ph.D. dissertation, Univ. Washington, Seattle,WA, 2001.


[3] Information Technology—Generic Coding of Moving Picturesand Associated Audio, Part 7: Advance Audio Coding, ISO/IECJTC1/SC29/WG11 MPEG, IS13818-7, 1997.

[4] J. S. Bendat and A. G. Piersol, Random Data: Analysis and MeasurementProcedures, 2nd ed. New York: Wiley, 1986.

[5] R. Crochiere, S. Webber, and J. Flanagan, “Digital coding of speech insub-bands,” in Proc. ICASSP, vol. 1, 1976, pp. 233–236.

[6] M. S. Vinton and L. E. Atlas, “Scalable and progressive audio codec,”in Proc. ICASSP, vol. 5, 2001, pp. 3277–3280.

[7] H. Hermansky, “Should recognizers have ears?,” Speech Commun., vol.25, pp. 3–27, 1998.

[8] D. W. Massaro, “Preperceptual images, processing time, and perceptualunits in auditory perception,” Psych. Rev., vol. 79, pp. 124–145, 1972.

[9] S. Sukittanon and L. E. Atlas, “Modulation frequency features for audiofingerprinting,” in Proc. ICASSP, vol. 2, 2002, pp. 1173–1176.

[10] H. Dudley, “Remaking speech,” J. Acoust. Soc. Amer., vol. 11, pp.169–177, 1939.

[11] T. Houtgast, H. J. M. Steeneken, and R. Plomp, “Predicting speech in-telligibility in rooms from the modulation transfer function. I. Generalroom acoustics,” Acustica, vol. 46, pp. 60–72, 1980.

[12] S. Greenberg and B. E. D. Kingsbury, “The modulation spectrogram: inpursuit of an invariant representation of speech,” in Proc. ICASSP, vol.3, 1997, pp. 1647–1650.

[13] H. Hermansky, S. Furui, B. H. Juang, and W. Chou, “The modulationspectrum in the automatic recognition of speech,” in Proc. IEEE Work-shop on Automatic Speech Recognition and Understanding, 1997, pp.140–147.

[14] S. Sheft and W. A. Yost, “Temporal integration in amplitude modulationdetection,” J. Acoust. Soc. Amer., vol. 88, pp. 796–805, 1990.

[15] T. Houtgast, “Frequency selectivity in amplitude-modulation detection,”J. Acoust. Soc. Amer., vol. 85, pp. 1676–1680, 1989.

[16] S. Ewert and T. Dau, “Characterizing frequency selectivity for envelopefluctuations,” J. Acoust. Soc. Amer., vol. 108, pp. 1181–1196, 2000.

[17] W. A. Gardner, Statistical Spectral Analysis: A Nonprobabilistic Theory.Englewood Cliffs, NJ: Prentice-Hall, 1988.

[18] L. Cohen, Time-Frequency Analysis. Englewood Cliffs, NJ: Prentice-Hall, 1995.

[19] S. Sukittanon, L. E. Atlas, J. W. Pitton, and J. McLauglin, “Nonsta-tionary signal classification using joint frequency analysis,” in Proc.ICASSP, vol. 6, 2003, pp. VI_453–VI_456.

[20] T. F. Quatieri, Discrete-Time Speech Signal Processing: Principles andPractice. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[21] R. J. Mammone, Z. Xiaoyu, and R. P. Ramachandran, “Robust speakerrecognition: A feature-based approach,” IEEE Signal Processing Mag.,vol. 13, pp. 58–71, Jan. 1996.

[22] H. Hermansky and S. Sharma, “Temporal patterns (TRAP’s) in ASR ofnoisy speech,” in Proc. ICASSP, vol. 1, 1999, pp. 289–292.

[23] RIAA and IFPI, Request for Information on Audio Fingerprinting Tech-nologies, 2001.

[24] M. Wells, V. Venkatachalam, L. Cazzanti, K. F. Cheung, N. Dhillon,and S. Sukittanon, “Automatic Identification of Sound Recordings,” U.S.Patent 20 030 086 341, May 2003.

[25] E. Allamanche, J. Herre, O. Hellmuth, B. Froba, and M. Cremer, “Au-dioID: Toward content-based identification of audio material,” in Proc.110th AES, Amsterdam, The Netherlands, 2001.

[26] S. Z. Li, “Content-based audio classification and retrieval using thenearest feature line method,” IEEE Trans. Speech Audio Processing,vol. 8, pp. 619–625, Sept. 2000.

[27] E. Wold, T. Blum, D. Keislar, and J. Wheaten, “Content-based classi-fication, search, and retrieval of audio,” IEEE Multimedia, vol. 3, pp.27–36, 1996.

[28] J. T. Foote, “Content-based retrieval of music and audio,” in Proc. SPIE,vol. 3229, 1997, pp. 138–147.

[29] P. Cano, E. Batlle, H. Mayer, and H. Neuschmied, “Robust sound mod-eling for song detection in broadcast radio,” in Proc. 112th AES, Munich,Germany, 2002.

[30] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of algorithmsfor audio fingerprinting,” in Proc. IEEE Workshop on Multimedia SignalProcessing, 2002, pp. 169–173.

[31] J. Haitsma, T. Kalker, and J. Oostveen, “Robust audio hashing for con-tent identification,” in Proc. Workshop on Content Based Multimediaand Indexing, Brescia, Italy, 2001.

[32] C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discriminant analysisfor audio fingerprinting,” IEEE Trans. Speech and Audio Processing,vol. 11, pp. 165–174, Mar. 2003.

[33] G. Peeters, A. L. Burthe, and X. Rodet, “Toward automatic music audiosummary generation from signal analysis,” in Proc. Int. Conf. Music In-formation Retrieval and Related Activities, Paris, France, 2002.

[34] M. K. Mihcak and R. Venkatesan, “A perceptual audio hashing algo-rithm: a tool for robust audio identification and information hiding,” inProc. Int. Information Hiding Workshop, Pittsburgh, PA, 2001.

[35] D. Fragoulis, G. Rousopoulos, T. Panagopoulos, C. Alexiou, and C. Pa-paodysseus, “On the automated recognition of seriously distorted mu-sical recordings,” IEEE Trans. Signal Processing, vol. 49, pp. 898–908,Apr. 2001.

[36] J. E. Shore and R. M. Gray, “Minimum cross-entropy pattern classifi-cation and cluster analysis,” IEEE Trans. Pattern Anal. Machine Intell.,vol. PAMI-4, pp. 11–17, 1982.

[37] J. Goldstein, J. C. Platt, and C. J. C. Burges, “Indexing High Dimen-sional Rectangles for Fast Multimedia Identification,”, Microsoft Re-search Tech. Rep. MSR-TR-2003-38, 2003.

Somsak Sukittanon (S’02–M’04) received the B.E.degree (first-class honors) from the King Monkut’sInstitute of Technology Ladkrabang, Bangkok,Thailand, in 1995 and the M.S. and Ph.D. degreesin electrical engineering from University of Wash-ington, Seattle, in 1999 and 2004, respectively.

During 1999 and 2000, he was an intern atCantametrix Inc., Bellevue, WA (now acquired byGracenote, Inc., Emeryville, CA). He created severalparametric music content extractors that contributedto a novel and powerful music content search

capability. In the summer of 2003, he was an intern at Microsoft Research,Redmond, WA, in the Communication, Collaboration, and Signal ProcessingGroup. His research interests include audio signal processing, audio classifi-cation and retrieval, embedded systems, speech enhancement, and machinelearning. He also holds a patent on a content-based audio identification.

Les E. Atlas (M’82–SM’03–F’04) received theB.S.E.E. degree from the University of Wisconsin,Madison, in 1978 and the M.S. and Ph.D. degreesin electrical engineering from Stanford University,Stanford, CA, in 1979 and 1984, respectively.

He joined the University of Washington, Seattle,in 1984, where he is a Professor of Electrical Engi-neering. His research is in digital signal processing,with specializations in acoustic analysis, time-fre-quency representations, and signal recognitionand coding. His research is supported by DARPA,

the Office of Naval Research, the Army Research Lab, and the WashingtonResearch Foundation.

Dr. Atlas received a National Science Foundation Presidential Young Inves-tigator Award and a 2004 Fulbright Senior Research Scholar Award. He wasGeneral Chair of the 1998 IEEE International Conference on Acoustics, Speech,and Signal Processing, Chair of the IEEE Signal Processing Society TechnicalCommittee on Theory and Methods, and a member-at-large of the Signal Pro-cessing Society’s Board of Governors.

James W. Pitton (S’85–M’94) received the B.S.E.degree in 1985 and the M.S.E. degree in 1986 fromthe University of Michigan, Ann Arbor, and the Ph.D.degree in 1994 from the University of Washington,Seattle, all in electrical engineering.

From 1986 to 1989, he was a Member of Tech-nical Staff with the Signal Processing Division, TheAnalytic Sciences Corporation, Reston, VA. From1994 to 1995, he was with the Speech ResearchDepartment, AT&T Bell Laboratories, Murray Hill,NJ. From 1996 through 1999, he was a Research

Scientist with MathSoft, Seattle, WA. He is currently a Principal Engineer andHead of the Environmental and Information Systems Department at the AppliedPhysics Laboratory, University of Washington. His research interests includestatistical signal processing, automatic classification, speech processing, andauditory science.

Dr. Pitton was Registration Chair at the IEEE ICASSP’98 in Seattle, WA.

IEEE SP Sukittanon03

Documents

Transcript of IEEE SP Sukittanon03