Audio Data Mining Using Multi-perceptron Artificial Neural Network

224 IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 Audo Data Mnng Usng Mult-perceptron Artfcal Neural Network Surendra Shetty, 2 K.K. Achary Dept of Computer Applcatons, NMAM Insttute Of Technology, Ntte,Udup, Karnataka, Inda, PIN- 5740, 2 Dept of Statstcs, Mangalore Unversty, Mangalagangothr, Inda, PIN-57499, Abstract--Data mnng s the actvty of analyzng a gven set of data. It s the process of fndng patterns from large relatonal databases. Data mnng ncludes: extract, transform, and load transacton data onto the data warehouse system, store and manage the data n a multdmensonal database system, provdes data, analyze the data by applcaton software and vsual presentaton. Audo data contans nformaton of each audo fle such as sgnal processng component- power spectrum, cepstral values that s representatve of partcular audo fle. The relatonshp among patterns provdes nformaton. It can be converted nto knowledge about hstorcal patterns and future trends. Ths work nvolves n mplementng an artfcal neural network (ANN) approach for audo data mnng. Acqured audo s preprocessed to remove nose followed by feature extracton usng cepstral method. The ANN s traned wth the cepstral values to produce a set of fnal weghts. Durng testng process (audo mnng), these weghts are used to mne the audo fle. In ths work, 50 audo fles have been used as an ntal attempt to tran the ANN. The ANN s able to produce only about 90% accuracy of mnng due to less correlaton of audo data. Index Terms ANN, Backpropagaton Algorthm, Cepstrum, Feature Extracton, FFT, LPC, Perceptron, Testng, Tranng, Weghts I. INTRODUCTION Data mnng s concerned wth dscoverng patterns meanngfully from data. Data mnng has deep roots n the felds of statstcs, artfcal ntellgence, and machne learnng. Wth the advent of nexpensve storage space and faster processng over the past decade, the research has started to penetrate new grounds n areas of speech and audo processng as well as spoken language dalog. It has ganed nterest due to audo data that are avalable n plenty. Algorthmc advances n automatc speech recognton have also been a major, enablng technology behnd the growth n data mnng. Currently, large vocabulary, contnuous speech recognzers are now traned on a record amount of data such as several hundreds of mllons of words and thousands of hours of speech. Poneerng research n robust speech processng, large-scale dscrmnatve tranng, fnte state automata, and statstcal hdden Markov modelng have resulted n real-tme recognzers that are able to transcrbe spontaneous speech. The technology s now hghly attractve for a varety of speech mnng applcatons. Audo mnng research ncludes many ways of applyng machne learnng, speech processng, and language processng algorthms []. It helps n the areas of predcton, search, explanaton, learnng, and language understandng. These basc challenges are becomng ncreasngly mportant n revolutonzng busness processes by provdng essental sales and marketng nformaton about servces, customers, and product offerngs. A new class of learnng systems can be created that can nfer knowledge and trends automatcally from data, analyze and report applcaton performance, and adapt and mprove over tme wth mnmal or zero human nvolvement. Effectve technques for mnng speech, audo, and dalog data can mpact numerous busness and government applcatons. The technology for montorng conversatonal audo to dscover patterns, capture useful trends, and generate alarms s essental for ntellgence and law enforcement organzatons as well as for enhancng call center operaton. It s useful for a dgtal object dentfer analyzng, montorng, and trackng customer preferences and nteractons to better establsh customzed sales and techncal support strateges. It s also an essental tool n meda content management for searchng through large volumes of audo warehouses to fnd nformaton, documents, and news. II. TECHNICAL WORK PREPARATION A. Problem statement Audo fles are to be mned properly wth hgh accuracy gven partal audo nformaton. Ths can be very much acheved usng ANN. Ths work nvolves n mplementng supervsed backpropagaton algorthm(bpa). The BPA s traned wth the features of audo data for dfferent number of nodes n the hdden layer. The layer wth optmal number of nodes has to be chosen for proper audo mnng. Manuscrpt receved October 5, 2008 Manuscrpt revsed October 20, 2008

IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 225 B. Overvew of Audo mnng Audo recognton s a classc example of thngs that the human bran does well, but dgtal computers do poorly. Dgtal computers can store and recall vast amounts of data perform mathematcal calculaton at blazng speeds and do repettve tasks wthout becomng bored or neffcent. Computer performs very poorly when faced wth raw sensory data. Teachng the same computer to understand audo s a major undertakng. Dgtal sgnal processng generally approaches the problem of audo recognton nto two steps, ) Feature extracton, 2) Feature matchng Each word n the ncomng audo sgnal s solated and then analyzed to dentfy the type of exctaton and resonate frequency [2]. These parameters are then compared wth prevous example of spoken words to dentfy the closest match. Often, these systems are lmted to few hundred words; can only accept sgnals wth dstnct pauses between words; and must be retraned. Whle ths s adequate for many commercal applcatons, these lmtatons are humblng when compared to the abltes of human hearng. There are two man approaches to audo mnng.. Text-based ndexng: Text-based ndexng, also known as large-vocabulary contnuous speech recognton, converts speech to text and then dentfes words n a dctonary that can contan several hundred thousand entres. 2. Phoneme-based ndexng: Phoneme based ndexng doesn t convert speech to text but nstead works only wth sounds. The system frst analyzes and dentfes sounds n a pece of audo content to create a phonetc-based ndex. It then uses a dctonary of several dozen phonemes to convert a user s search term to the correct phoneme strng. (Phonemes are the smallest unt of speech n a language, such as the long a sound that dstngushes one utterance from another. All words are sets of phonemes). Fnally, the system looks for the search terms n the ndex. A phonetc system requres a more propretary search tool because t must phonetcze the query term, and then try to match t wth the exstng phonetc strng output. Although audo mnng developers have overcome numerous challenges, several mportant hurdles reman. Precson s mprovng but t s stll a key ssue mpedng the technology s wdespread adopton, partcularly n such accuracy-crtcal applcatons as court reportng and medcal dctaton. Audo mnng error rates vary wdely dependng on factors such as background nose and cross talk. Processng conversatonal speech can be partcularly dffcult because of such factors as overlappng words and background nose [3][4]. Breakthroughs n natural language understandng wll eventually lead to bg mprovements. The problem of audo mnng s an area wth many dfferent applcatons. Audo dentfcaton technques nclude, Channel vocoder Lnear predcton Formant vocodng Cepstral analyss There are many current and future applcatons for audo mnng. Examples nclude telephone speech recognton systems, or voce dalers on car phones. C. Schematc dagram The sequence of Audo mnng can be schematcally shown as below... Sequence of audo processng D. Artfcal neural network A neural network s constructed by hghly nterconnected processng unts (nodes or neurons) whch perform smple mathematcal operatons [5]. Neural networks are characterzed by ther topologes, weght vectors and actvaton functon whch are used n the hdden layers and output layer [6]. The topology refers to the number of hdden layers and connecton between nodes n the hdden layers. The actvaton functons that can be used are sgmod, hyperbolc tangent and sne [7]. A very good account of neural networks can be found n []. The network models can be statc or dynamc [8]. Statc networks nclude sngle layer perceptrons and multlayer perceptrons. A perceptron or adaptve lnear element (ADALINE) [9] refers to a computng unt. Ths forms the basc buldng block for neural networks. The nput to a perceptron s the summaton of nput pattern vectors by weght vectors. In Fgure 2, the basc functon of a sngle layer perceptron s shown. Fg

226 IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 Fg. 2. Operat on of a neuron In Fgur e 3, a mult layer perceptron s shown schematcally. Informaton flows n a feed-forward manner from nput layer to the output layer through hdden layers. The number of nodes n the nput layer and output layer s fxed. It depends upon the number of nput varables and the number of output varables n a pattern. In ths work, there are sx nput varables and one output varable. The number of nodes n a hdden layer and the number of hdden layers are varable. Dependng upon the type of applcaton, the network parameters such as the number of nodes n the hdden layers and the number of hdden layers are found by tral and error method [0] g. 3 Multl ayer percep tron In most of the appl cato ns one hdde n layer s suffcent. The actvaton functon whch s used to tran the ANN, s the sgmod functon and t s gven by: f(x) = + exp( -x) () where, f (x) s a non - lnear dfferentable functon, Nn x = å W = j (p) x n (p) + θ (p), where, N n s the total number of nodes n the n th layer F W j s the weght vector connectng th neuron of a layer wth the j th neuron n the next layer. q s the threshold appled to the nodes n the hdden layers and output layer and p s the pattern number. In the frst hdden layer, x s treated as an nput pattern vector and for the successve layers, x s the output of the th neuron of the proceedng layer. The output x of a neuron n the hdden layers and n the output layer s calculated by: X n + (p) = (2) + exp( - x + θ(p)) For each pattern, error E (p) n the output layers s calculated by: E(p) = 2 N M M 2 Σ (d (p) - x (p)) (3) = where M s the total number of layer whch nclude the nput layer and the output layer, N M s the number of nodes n the output layer. d (p) s the desred output of a pattern and X M (p) s the calculated output of the network for the same pattern at the output layer. The total error E for all patterns s calculated by: E = å L (4) E(p) p= where, L s the total number of patterns. E. Implementaton The flowchart, fg. 4 explans the sequence of mplementaton of audo mnng. Ffty audo fles were chosen. The feature extracton procedure s appled Preemphaszng and wndowng. Audo s ntrnscally a hghly non-statonary sgnal. Sgnal analyss, FFT-based or Lnear Predctor Coeffcents (LPC) based, must be carred out on short segments across whch the audo sgnal s assumed to be statonary. The feature extracton s performed on 20 to 30 ms wndows wth 0 to 5 ms shft between two consecutve wndows. To avod problems due to the truncaton of the sgnal, a weghtng wndow wth the approprate spectral propertes must be appled to the analyzed chunk of sgnal. Some wndows are Hammng, Hannng and Blackman.

IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 227 Normalzaton Feature normalzaton can be used to reduce the msmatch between sgnals recorded n dfferent condtons. Normalzaton conssts n mean removal and eventually varance normalzaton. Cepstral mean subtracton (CMS) s a good compensaton technque for convolutve dstortons. Varance normalzaton conssts n normalzng the feature varance to one and n sgnal recognton to deal wth noses and channel msmatch. Normalzaton can be global or local. In the frst case, the mean and standard devaton are computed globally whle n the second case, they are computed on a wndow centered on the current tme. Feature extracton method LPC starts wth the assumpton that an audo sgnal s produced by a buzzer at the end of a tube, wth occasonal added hssng and poppng sounds. Although apparently crude, ths model s actually a close approxmaton to the realty of sgnal producton. LPC analyzes the sgnal by estmatng the formants, removng ther effects from the sgnal, and estmatng the ntensty and frequency of the remanng buzz. The process of removng the formants s called nverse flterng, and the remanng sgnal after the subtracton of the fltered modeled sgnal s called the resdue. The numbers, whch descrbe the ntensty and frequency of the buzz, the formants, and the resdue sgnal, can be stored or transmtted somewhere else. LPC syntheszes the sgnal by reversng the process: use the buzz parameters and the resdue to create a source sgnal, use the formants to create a flter, and run the source through the flter, resultng n audo. Steps:. Audo fles n mono or stereo recorded n natural or nsde lab, or taken from a standard data base 2. Extractng features by removng nose provded t s a fresh audo, otherwse for exstng audo, nose removal s not requred 3. Two phases have to be adopted: Tranng phase and testng phase 4. Tranng Phase: In ths phase, a set of representatve numbers are to be obtaned from an ntal set of numbers. BPA s used for learnng the audo fles 5. Testng phase: In ths phase, the representatve numbers obtaned n step 4 has to be used along wth the features obtaned from a test audo fle to obtan, an actvaton value. Ths value s compared wth a threshold and fnal decson s taken to retreve an audo fle or offer soluton to take further acton whch can be actvatng a system n a moble phone, etc. Fg.4 Flowchart for LPC analyss F. Results and Dscusson Cepstrum analyss s a nonlnear sgnal processng technque wth a varety of applcatons n areas such as speech and mage processng. The complex cepstrum for a sequence x s calculated by fndng the complex

228 IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 Natural logarthm of the Fourer transform of x, then the nverse Fourer transform of the resultng sequence. The complex cepstrum transformaton s central to the theory and applcaton of homomorphc systems, that s, systems that obey certan general rules of superposton. The real cepstrum of a sgnal x, sometmes called smply the cepstrum, s calculated by determnng the natural logarthm of magntude of the Fourer transform of x, then obtanng the nverse Fourer transform of the resultng sequence. It s dffcult to reconstruct the orgnal sequence from ts real cepstrum transformaton, as the real cepstrum s based only on the magntude of the Fourer transform for the sequence. Table, gves the cepstral coeffcents for 25 sample audo fles. Each row s a pattern used for tranng the ANN wth BPA. The topology of the ANN used s 6 6. In ths, 6 nodes n the nput layer, 6 nodes n the hdden layer and node n the output layer s used for proper tranng of ANN followed by audo mnng. TABLE CEPSTRAL FEATURES OBTAINED FROM SAMPLE AUDIO FILES S. No Inputs F F2 F3 F4 F5 F6.0509 0.2882.3973.0009 0.4728.4385.0 2 0.5265 0.595.23 0.4965 0.6709.266.02 3 0.7240 0.477.208 0.624 0.489.080.03 4 0.5834 0.5237.0096 0.3820 0.5322 0.8734.04 5.890 0.2859.367 0.900 0.4903.2222.05 6.6472 0.2045.736 0.9678 0.734.530.06 7 0.5673 0.5900.4 0.553 0.6330.2625.07 8 0.5608 0.630.0655 0.573 0.5404.095.08 9 0.8946 0.3685.256 0.8690 0.4926.3202.09 0 3.2865.4042 2.8429 0.322 0.4480 0.0497.0 0.5844 0.5030 0.9480 0.5972 0.3445.0805. 2 0.5526 0.5753.038 0.499 0.608.56.2 3 0.067 0.529 0.7697 0.270 0.5747 0.8090.3 Tar get Lab elng 4 0.4723 0.5623 0.982 0.3272 0.4989 0.794.4 5.3434.3344 0.708.7522 0.5459.0386.5 6.2278 0.2356.53.2207 0.4238.68.6 7 0.7858 0.4885.2054 0.6380 0.5747.776.7 8 0.2584.3039 0.095 0.4740 0.205 0.0947.8 9 0.6000 0.6270.0688 0.3905 0.6478.0870.9 20 0.8567 0.3006.244 0.733 0.546.29.20 2.84 0.3790.6468 0.9274 0.674.5006.2 22 3.073.03.4705 2.7696 0.4590 2.3783.22 23.2440 0.2540.3620 0.679 0.6099 0.9259.23 24 0.0384 0.534 0.4769 0.3072 0.552 0.7388.24 25 0.0384 0.534 0.4769 0.3072 0.552 0.7388.25 F F6 are cepstral values. We can choose more than 6 values for an audo fle. Target labelng should be less than and greater than zero. When the number of audo fle ncreases, then more decmal values have to be ncorporated. III.CONCLUSION Audo of common brds and pet anmals have been recorded casually. The audo fle s sutably preprocessed followed by cepstral analyss and tranng ANN usng BPA. A set of fnal weghts wth 6 6 confguraton s obtaned wth 7350 teratons to reach.025 mean squared error rate. Ffty patterns have been used for tranng the ANN. Thrty patterns were used for testng (audo mnng). The results are close to 90% of mnng as the audo was recorded n open. The percentage of recognton and audo mnng accuracy has to be tested wth large number of new audo fles from the same set of brds and pet anmals IV. REFERENCES [] Le Lu and Hong-Jang Zhang, Content analyss for audo classfcaton and segmentaton., IEEE Transactons on Speech and Audo Processng, 0:504 56, October 2002. [2] T. Tolonen and M. Karjalanen, A computatonally effcent multptch analyss model, IEEE Transactons on Speech and Audo Processng, Vol. 8(No. 6):708 76, November 2000. [3] Haleh Vafae and Kenneth De Jong, Feature space transformaton usng genetc algorthms, IEEE Intellgent Systems, 3(2):57 65, March/Aprl 998. [4] Usama M. Fayyad, Data Mnng and Knowledge Dscovery: Makng Sense Out of Data, IEEE Expert, October 996, pp. 20-25. [5] Fortuna L, Grazan S, LoPrest M and Muscato G (992), Improvng back-propagaton learnng usng auxlary neural networks, Int. J of Cont., 55(4), pp 793-807.

IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.8 No.0, October 2008 229 [6] Lppmann R P (987) An ntroducton to computng wth neural nets, IEEE Trans. On Acoustcs, Speech and Sgnal Processng Magazne, V35, N4, pp.4.-22 [7] Yao Y L and Fang X D (993), Assessment of chp formng patterns wth tool wear progresson n machnng va neural networks, Int.J. Mach. Tools & Mfg, 33 (), pp 89-02. [8] Hush D R and Horne B G (993), Progress n supervsed neural networks, IEEE Sgnal Proc. Mag., pp 8-38. [9] Bernard Wdrow (990), 30 Years of adaptve neural networks: Perceptron, madalne and back-propagaton, Proc. of the IEEE, 8(9), pp 45-442. [0] Hrose Y, Yamashta K Y and Hjya S (99), Back-propagaton algorthm whch vares the number of hdden unts, Neural Networks, 4, pp 6-66. [] Smon Haykn, Neural Networks-A Comprehensve foundaton, 2 nd edton. V. ABOUT THE AUTHORS Processng. Surendra Shetty, s workng as Lecturer n MCA Department at NMAMIT, Ntte, Karkala Taluk, Udup Dstrct, Karnataka, INDIA. He has receved MCA Degree and pursung Ph.D degree n the area of Data Mnng at Mangalore Unversty. Hs man research nterest ncludes Data Mnng, Artfcal Intellgence & Sgnal Dr.K.K.Achary, s a Professor of Statstcs n the Dept. of Post-graduate Studes and Research n Statstcs, at Mangalore Unversty,Inda. He holds M.Sc. degree n Statststcs and Ph.D. n Appled Mathematcs from Indan Insttute of Scence, Bangalore, Inda. Hs current research nterests nclude Stochastc Models, Inventory Theory, Face Recognton Technques and Audo Data Mnng. Hs research papers have appeared n European Journal of Operatonal Research, Journal of Operatonal Research Socety, Opsearch, CCERO, Internatonal Journal of Informaton and Management Scences, Statstcal Methods and Amercan Journal of Mathematcs and Management.