Design and Implementation of Speech Recognition System Based on Field Programmable Gate Array

Vol. 3, No. 8 Modern Appled Scence Desgn and Implemenaon of Speech Recognon Sysem Based on Feld Programmable Gae Array Haao Zhou Informaon and Communcaon Deparmen Tann Polyechnc Unversy Tann 30060, Chna E-mal: zhouann@26.com Xaoun Han Informaon and Communcaon Deparmen Tann Polyechnc Unversy Tann 30060, Chna The research s fnanced by Appled Program of Basc Research of Tann (08JCYBJC4700) Absrac In hs paper, a Hdden Markov Model (HMM) speech recognon sysem whch s based on Feld Programmable Gae Array (FPGA) s desgned. I nroduces he prncple of speech recognon algorhm and deduces he hardware frameworks accordngly. In erms of HMM recognon module, he convenonal Verb algorhm has been mproved and recognon speed has been ncreased. The core par of he hardware s EP2S60F020C3 FPGA chp. The expermenal resul of hs sysem shows ha he speech recognon accuracy reaches 94% when en numbers are beng recognzed, and he average recognon me s 0.669s. Keywords: Feld Programmable Gae Array, Hdden Markov Model, Speech Recognon, Verb Algorhm. Inroducon As a new convenen means of human-machne neracon, speech recognon s wdely appled o many porable embed speech producs. The ulmae am of speech recognon s o make machne undersand naural language. I s of grea sgnfcance no only n praccal applcaon bu scenfc research. The research on speech recognon echnology manly concenraes on wo aspecs. One s he sofware runnng on compuer, he oher s embedded sysems. The advanages of Embedded sysems are hgh-performance, convenence, cheap and hey have huge poenal for developmen. FPGA has advanages of shor developmen cycle, low-cos desgn and low-rsk. In recen years, FPGA has become he key componens n hgh-performance dgal sgnal processng sysems n dgal communcaon, nework, vdeo and mage felds. In hs paper, he desgn was mplemened on an EP2S60F020C3 FPGA, sng on srax II developmen board. 2. Speech Recognon Bascs Fg. shows he speech recognon algorhm flow. A ypcal speech recognon sysem sars wh he Mel Frequency Cepsrum Coeffcen (MFCC) feaure analyss sage, whch s composed of he followng ems: ) Pre-emphass. 2) Dvde he speech sgnal no frames. 3) Apply he hammng wndow. 4) Compue he MFCC feaure. The second sage s vecor quanzaon sage. In hs sage, codebook s used o quanze he MFCC feaure and ge MFCC feaure vecor. The codebook s generaed on compue va LBG arhmec, and s downloaded o ROM. The las sage s recognon, whch s performed by usng a se of sascal models.e. hdden Markov models (HMM). In hs sage, he probably 06

Modern Appled Scence Augus, 2009 of MFCC feaure vecor has been generaed by each model and he resul s he model whch generaed he larges probably. 2. MFCC Feaure analyss Fgure 2 shows he process of creang MFCC feaures. The frs sep s o be aken he Dscree Fourer Transform (DFT) of each frame. Ceran amoun of 0s are added o he end of Tme-doman sgnal s(n) of each frame, n order o form he sequence of N-lengh. And hen he DFT of each frame s aken o ge he lnear specrum X (k). In he second sep, lnear specrum X (k) s mulpled by he Mel frequency fler banks and convered o Mel specrum. Mel frequency fler banks are several band pass flers H m (k), and each band pass fler s defned as follows: 0 ( k < f ( m )) k f ( m ) ( f ( m ) k f ( m)) f ( m) f ( m ) H m ( k) = (0 m < M ) f ( m + ) k ( f ( m) k f ( m _)) f ( m + ) f ( m) 0 ( k > f ( m + )) Where 0 m < M, M s he number of he band pass flers, and f (m) s he cenral frequency. The hrd sep s o be aken he logarhm of Mel specrum o ge logarhmc specrum S (m). Thus, he ransfer funcon from lnear specrum X (k) o logarhmc specrum S(m) s N 2 S( m) = ln X ( k) H m ( k) (0 m < M ) (2) k = 0 In he las sep, logarhmc specrum S(m) s ransformed no cepsrum frequency by Dscree cosne Transform (DCT)n order o yeld MFCC feaure. 2.2 Vecor Quanzaon In hs paper, due o he dscree hdden markov model s used, s necessary o ransform connuous MFCC feaure whch has been yelded no dscree MFCC feaure. K Vecor quanzaon s o map one K dmensonal vecor X X ~ R o anoher K dmensonal quanze vecor ~ K Y Y = { Y, Y, L Y Y R }, n where X s npu vecor, Y s quanze vecor or codeword, X ~ s source space, N 2 Y ~ s oupu space, N s he sze of codebook, and N N K R s K dmensonal Eucldean space. The process of quanzng vecor X s o search a codeword whch s he neares one from he vecor X n codebooky ~ N. In hs paper, square dsoron measure s appled o calculae dsoron, whch s defned as 2 = X (3) d( X, Y ) Y 2.3 HMM Recognon The role of HMM Recognon s o fnd ou he maxmum probably of he HMM whch has generaed he feaure vecor, accordng o he gven feaure vecor. In hs paper, verb algorhm s used o solve he problem, and an mproved algorhm s proposed based on he orgnal algorhm. The gven HMM parameers λ = { π, A, B} ( π = { π }, A = { a}, B = { b k }), and he observaon sequenceo = O, O2, L, O, n where N s he number of HMM saes, T ( ) s he hghes probably along wh a sngle pah, a me, whch accouns for he frs observaons and ends n sae, ϕ ( ) s he HMM sae a me. The dealed algorhm s defned as follow: ) Inalzaon ( ) = π b ( O ) ϕ ( ) = 0 ( N) 2) Recurson ) [max{ ( ) a }] b ( O ) ( N) ( = + ϕ ) = arg max{ ( ) a } ( N) (5) ( 3) Termnaon () (4) 07

Vol. 3, No. 8 Modern Appled Scence P = max[ ( )] ( N) 08 T q = argmax[ ( )] (6) 4) Pah backrackng T q = ϕ + ( q+ ) ( T ) (7) 5) Algorhm mprovng In pracce, π, A and B are decmal fracons beween 0 and. I s no conducve for FPGA o mplemen decmal fracon operaon, because decmal fracon mulplcaon may cause he problem of gross underflow when T s larger han a hreshold. So s mporan o ake he logarhm of π, A and B before operaon. When π, A and B are ransformed o logarhmc probably π, A and B, floang pon numbers mulply operaon s ransformed o neger addon operaon. In addon, consderng akng ou he sgn b before operaon, (4) and (5) should be changed o ( ) = π b ( O ) ( N) (8) ϕ ( ) = 0 ) [mn{ ( ) + a }] b ( O ) ( N) ( = + + ϕ ( ) arg mn{ ( ) + a } ( N) (9) = Thus (8) and (9) are mproved algorhm expresson. 3. Desgn of Speech recognon hardware 3. Desgn of MFCC module hardware As shown n Fg.3, MFCC module s conssed of DFT module, Mel fler banks, endpon deecon module, logarhm operaon module, DCT module, oupu conrol module and conrol module. Speech sgnal s sampled a a sample rae of 8 k. Each speech frame s composed of 256 24-b sample pons. Daa wll be sen o he Mel fler bank under he conrol of he conrol elemen, and he resul afer calculaon wll be old o conrol elemen. The oupu of he Mel fler bank wll be expored o logarhm compuaon un and DCT module o calculae he MFCC parameer. Meanwhle, he pon deecon wll be execued: conrol elemen deermne wheher pu ou he MFCC parameer accordng o he oupu of speech endpon module. Daa ge hrough he module n a ppelne mode, whch enhance he sysem processng speed. 3.2 Desgn of Vecor quanzaon module hardware Vecor quanzaon module hardware s desgned as Fg.4. The order number s sored n couner. The ndex of codebook s sored n couner2. The ndex of he neares codebook s sored n regser2. The value of he dsance beween ROM (codebook) and RAM (MFCC) s sored n address module. The work flow s shown as follows: ) Under he conrol of he conroller, couner sars counng. The MFCC of each frame and codebook are read, subraced, and send o accumulaor. 2) To compare he value of regser2 and he oupu of accumulaor: f he oupu of accumulaor s larger han he value of regser2, he conroller sops he compue of curren codebook and ends o nex codebook. Couner and accumulaor are cleared. The value of couner2 plus. 3) If he oupu of accumulaor s less han he value of regser2 when he value of couner s 2. The oupu of accumulaor s sored n couner 2. The curren value of couner 2 s sored n regser. The ndex of he neares codebook and he ndex of codebook are renewed. Couner and accumulaor are cleared. The value of couner2 plus. 4) To repea above process, unl he value of couner 2 s 256. Then he vecor quanzaon of a speech frame s accomplshed. 3.3 Desgn of HMM recognon module hardware A 4 sae lef-o-rgh HMM whou skppng s adoped n hs paper. The desgn of HMM recognon module hardware s shown n Fg.5. FSM s he conroller of sae machne. The observaon sequence s sored n RAM O. The value of nal probably s sored n RAM P. Sae ranson probably A s sored n RAM A. Oupu probably B s sored n RAM B. The address of RAM A and REM B are generaed from GENaddrA and GENaddrB respecvely. CurrenMn s used for preservng he smalles probably of he recognon model unl he curren model, CounerIndex s used for savng he model label of he smalles

Modern Appled Scence Augus, 2009 probably, The key pon of he Verb algorhm s seekng ( ) va ype (9), as a resul, he PE un has been desgned for calculang ( ) n hs paper. As shown n Fgure 6, PE un s conssed of hree adders and wo daa selecors. Frs, o calculae he value of ( ) + a and, ( ) + a. Second, o choose he smalles value hrough daa selecor,, and add a value of b ( O ) on o ge ( ). In he nal sae, f =, only needs o compare he value of ( ) + a and, π, and o ake he smaller value as he smalles value. 4. Implemenaon and Resuls I was acheved he enre voce ranng and he recognon process by usng Srax II EP2S60 DSP developmen board as he hardware plaform of Voce processng module. Fg.7 s he RTL vew. Acqure he voce sgnal hrough he mcrophones and PC-n ape recorder. The sample rae was 025KHz, and he sample precson was 6bs. Gan 50 samples for each mandarn dg from o 0 as he expermen subecs. The expermenal resuls were shown n able. The average recognon accuracy of speaker-ndependen mandarn dgs reaches 94% and he average recognon me s 0.669s n hs sysem, whch acheves he recognon rae and real-me requremens. 5. Conclusons In hs paper, a FPGA-based Hdden Markov Model speech recognon sysem was desgned. I complees he acquson of voce by mcrophone and PC-n ape recorder and he generaon of code book and ranng daa. In he sysem, calculae he MFCC feaure vecor was calculaed, quanzed and recognzed by Verb algorhms. In he HMM recognon, he radonal Verb algorhm was mproved o enhance he recognon speed, whch was able o mee he needs for real-me voce recognon sysems and he requremens of he recognon accuracy. References Alera Corporaon. (2006). Nos II Processor Reference Handbook, -. Alera Corporaon. (2006). Nos II Sofware Developer s Handbook, 4-. Bok-Gue Park, Koon-shk Cho, & Jun-Dong Cho. (2002). Low power VLSI archecure of verb scorer for HMM-based solaed word recognon. Inernaonal Symposum on Qualy Elecronc Desgn, 235-39. Elmsery, F, A, Khall, A, H, Salama, A, E, & Hammed, H, E. (2003). A FPGA-Based HMM for a dscree Arabc speech recognon sysem. Proceedngs of he 5h Inernaonal Conference on 9-0 Dec, 322-325 Lawrence, R, Rabner. (989). A Tuoral on Hdden Markov Models and Seleced Applcaons n Speech Recognon. Proceedngs of he IEEE, VOL.77, NO 2, February. Lawrence, Rabner, & Bng-Hwang, Juang. (999). Fundamenals of speech recognon. Beng: Prence-Hall Inernaonal, Inc. Melnkoff, S, J, Qugley, S, F, & Russell, M, J. (2002). Implemenng a smple connuous speech recognon sysem on an FPGA. Feld-Programmable Cusom Compung Machnes, Proceedngs.0h annual IEEE Symposum, 275-276 Nedevsch, S, Para, R, K, & Brewer, E, A. (2005) Hardware speech recognon for user nerfaces n low cos, low power devces.desgn, Auomaon Conference. Proceedngs. 42nd3-7 June, 684-689. Yoshzawa, S, Mynamaga, Y, & Wada, N. (2002). A low-power VLSI desgn of an HMM based speech recognon sysem. Crcu sand Sysems. Mdwes Symposum on Volume 2, II-489-II-49292. Table. Expermen resul Number 2 3 4 5 6 7 8 9 0 Correc rae (%) 96 94 94 92 96 92 94 92 96 94 Tme ( μ s ) 0.67 0.69 0.65 0.70 0.66 0.65 0.68 0.65 0.64 0.70 09

Vol. 3, No. 8 Modern Appled Scence Fgure. Speech recognon algorhm flow s (n) X (k) s (m) c(n) Fgure 2. MFCC Feaure analyss algorhm flow Fgure 3. MFCC feaure analyss hardware srucure Conroller Couner Couner2 Regser GEN addr ROM (codebook) RAM (MFCC) SUB MUL Regser2 ACCU MIN? Fgure 4. Vecor quanzaon hardware srucure 0

Modern Appled Scence Augus, 2009 FSM Frame Couner Nodel Couner Sae Couner Couner ndex RAM O GEN addrb GEN addra Curren Mn RAM P RAM A RAM B Buffer PE PE2 PE3 PE4 MIN? Fgure 5. HMM recognon hardware srucure ( ) ( ) a, π b ( O ) MUX MUX ( ) a, Fgure 6. processng elemen Fgure 7. RTL vew