G2PIL: A GRAPHEME-TO-PHONEME CONVERSION TOOL FOR THE ITALIAN LANGUAGE Michele Salvatre Bindi 1, Vincenz Catania 2 Raffaele Di Natale 3 Ylenia Cilan 5 Antni Rsari Intilisan 4 Giuseppe Mntelene 6 Daniela Pann 7 1 DIEEI, University f Catania, Italy 2 DIEEI, University f Catania, Italy 3 DIEEI, University f Catania, Italy 4 DIEEI, University f Catania, Italy 5 DIEEI, Giuseppe Mntelene, Italy 6 A-Tn Crprate, Italy 7 DIEEI, University f Catania, Italy ABSTRACT This paper presents a knwledge-based apprach fr the grapheme t-phneme cnversin (G2P) f islated wrds f the Italian language. With mre than 7,000 languages in the wrld, the biggest challenge tday is t rapidly prt speech prcessing systems t new languages with lw human effrt and at reasnable cst. This includes the creatin f qualified prnunciatin dictinaries. The dictinaries prvide the mapping frm the rthgraphic frm f a wrd t its prnunciatin, which is useful in bth speech synthesis and autmatic speech recgnitin (ASR) systems. Fr training the acustic mdels we need an autmatic rutine that maps the spelling f training set t a string f phnetic symbls representing the prnunciatin. KEYWORDS ASR Acustic mdel, Phnetic Dictinary, Spken Dialg Systems 1.INTRODUCTION Autmatic Speech Recgnitin (ASR) has evlved significantly ver the past few years. Early systems typically discriminated islated digits, whereas current systems perfrm well at recgnizing spntaneus cntinuus speech. Huge effrt has been spent fr imprving wrd recgnitin rates, but the cre acustic mdelling has remained nt stable r nt available fr language like Italian, despite many attempts t develp better alternatives [1]. Handcrafted creatin f prnunciatin dictinaries fr speech prcessing systems can be time-cnsuming and expensive [2]. In ur previus wrk an Acustic Mdel fr Italian language frm speech crpra generated by Audibks has been created [3]. The new wrds btained thrugh speech crpra d nt cntain a crrespnding phnetic transcriptin. As prnunciatin dictinaries are s fundamental t speech prcessing systems, much care has t be taken t create a dictinary that is as free f errrs as pssible. Faulty prnunciatins in the dictinary may lead t incrrect training f the system and cnsequently t a system that des nt explit its full ptential. Flawed dictinary entries can riginate frm G2P cnverters with shrtcmings [4]. In this paper we DOI : 10.5121/ijnlc.2015.4103 31
present a graphemes t phnemes algrithm fr Italian language that autmatically generates crrect phnetic transcriptin. The paper is rganized as fllws. Sectin 2 presents the related Wrk. Sectin 3 presents the prpsed grapheme-t-phneme cnversin apprach fllwed by presentatin f experimental results in sectin 4. Finally, we cnclude in Sectin 5. 2. RELATED WORK Grapheme-t-Phneme (G2P) cnversin is an imprtant prblem related t Natural Language Prcessing (NLP), Speech Recgnitin and Spken Dialg Systems (SDS) develpment. The gal f G2P cnversin is t accurately predict the prnunciatin f a new input wrd. Fr example, t predict the wrd SPEECH the g2p generate the fllwing, SPEECH S P IY CH This prblem is straightfrward fr sme languages like Spanish r Italian, where prnunciatin rules are cnsistent. Fr languages like English and French hwever, incnsistent cnventins make the prblem much mre challenging [5]. There are many different appraches used fr the G2P cnversin prpsed by different researchers. Statistical mdels such as decisin trees, Jint- Multigram Mdel (JMM) [6], r Cnditinal Randm Field (CRF) [7] are used t learn prnunciatin rules. All these appraches invariantly assume access t prir linguistic resurces cnsisting f sequences f graphemes and their crrespnding sequences f phnemes [8]. Writing by hand, this rule is hard and very time cnsuming. The difficulty and apprpriateness f using G2P rules are very language dependent. Currently Sphinx-4 [9] uses a predefined dictinary fr mapping wrds t sequence f phnemes. Recently, Sphinx-4 is able t use trained mdels (based n machine learning algrithm) t map letters t phnemes and thus map wrds in a sequence f phnemes withut the need f a predefined dictinary. A predefined dictinary will be used t train the required mdels. Our apprach des nt use a training methd r initial dictinary, but grammatical rules nly. 3. G2P ALGORITHM This wrk prpses a nvel grapheme-t-phneme G2P cnversin apprach. One f the key issue when develping a G2P cnverter is hw t effectively learn/capture the relatin between phnemes and graphemes. Every step f this algrithm is based n a specific set f rules develped via linguistic engineering. The substring with which the wrd ends is cmpared with a list f available desinences. There is a list f desinences in which every desinence is written accrding t the fllwing frmat: vwel_r_cnsnant+desinence,categry,part_f_speech,phnetical_transcriptin vwel_r_cnsnant indicates if desinence must be preceded by a vwel, a cnsnant, r bth. This parameter is ptinal and can take 3 values: V (vwel) C (cnsnant) 32
VC (vwel + cnsnant) desinence is the analysed desinence, which is the string cmpared with the substring with which the wrd ends. The desinences has been btained frm [10]. Fr each desinence, inflected versins are btained by analysing the categry. categry is used t get the inflectins f each desinence. A letter represents each categry. There is a list f categries in which each categry is assciated with sme characteristics. Each characteristic is written accrding t the fllwing frmat (in which: s=substring, pt=phnetic transcriptin, i=inflected): categry: s 1,1 -pt 1,1 > si 1,1 -pti 1,1,, si 1,n -pti 1,n ; ; s k,1 -pt k,1 > si k,1 -pti k,1,, si k,n -pti k,n ; In general, if a desinence belngs t a certain categry and it ends in s k,1, t btain the inflected frm, s k,1 is remved frm the desinence and it is replaced with si k,1, pt k,1 is remved frm its phnetic transcriptin and it is replaced with pti k,1. This prceeding is applied fr all eventual n inflectins. Fr example, cnsider the D categry that is written as: D:gia-dZ i! a>gie-dz i! e;cia-ts i! a>cie-ts i! e; This means that if a desinence belngs t the categry D, if it ends in "-gia" and the crrespndent phnetic transcriptin ends in "-dz i! a ", the inflected frm is btained by remving "-gia" and adding "-gie " in the desinence and by eliminating " dz-i! a " and adding "- dz i! e " in the phnetic transcriptin. There is als the "I" categry fr desinences that d nt have inflected frms: I:; part_f_speech indicates the part f speech, is used because sme equal desinences endings have different accents emphasis depending n the part f speech. The used parts f speech are: V fr verbs N fr nuns and adjectives A fr ther phnetical_transcriptin is the phnetic transcriptin f the desinence. The desinences that can be recgnized can als cntain an accent, placed in a particular psitin (represented by the symbl "!"). The presence f an accent, resulting in the final transcriptin is ptinal. If n desinence has been identified, the wrd is analysed letter by letter t retrieve the transcriptin. In this case, if the wrd des nt include accents and require the presence f the accents in the transcript, the algrithm generates a set f pssible transcriptins with an emphasis n the different pssible psitins. The user will chse the crrect transcriptins. 33
Even when the presence f accents is nt required, the algrithm in sme cases can generate multiple pssible transcriptins fr a certain wrd and, in this case, the user chses the crrect transcriptins. In any case, a wrd can have multiple crrect transcriptins (example with an accent: Ankara - > a ng k! r a and a! ng k r a ; example withut accent: casale -> c a s a l e and c a z a l e ). The creatin f phnetic transcriptins analysing the wrd letter by letter is the main part f the algrithm. The Italian language uses a subset f the phnemes f the Internatinal Phnetic Alphabet (IPA). IPA is an alphabetic system used t represent the sunds in phnetic transcriptins. Fr urs phnetic transcriptin the Italian phnemes are used [11], except fr tw, in fact, there are tw ways t prnunce the vwel e and tw t prnunce the vwel. Nevertheless, they generate a very similar sund, s in this algrithm a unique way t prnunce the e and a unique way t prnunce the are used. Mrever, these sunds are s similar that different speakers can prnunce them in bth different ways accrding t their gegraphical rigin. The rules fr the algrithm are based n knwledge f the Italian language. In the Italian language, a specific sund crrespnds t each cmbinatin f characters. Fllwing, sme examples f sunds prduced by particular cmbinatins f letters are listed: c + a, +he, +hi,, u: a harsh sund is prduced (as in wrds cane, che, chil crda, culla ) that crrespnds t the phneme k. c + e, i: a sft sund is prduced (as in wrds ciel, cena ) that crrespnds t the phneme ts. g + a, +he, +hi,, u: a harsh sund is prduced (as in wrds gatt, ghett, ghir, gnna, guf ) that crrespnds t the phneme g. g + e, i: a sft sund is prduced (as in wrds gel, gir ) that crrespnds t the phneme dz gn + a, e, i,, u: as in the Italian wrd gnm, this sund des nt exist in English and crrespnds t the phneme J. h: n sund. gl + i, +e: as in the Italian wrd figli, this sund des nt exist in English and crrespnds t the phneme L. n +f, +v: as in the Italian wrds invece and infatti ; in English it crrespnds t the sund prduced in the wrd symphny. n +c, +g: as in the Italian wrd anche, in English it crrespnds t the sund prduced in the wrd sing. sc + e, i: as in the Italian wrds scena and scimmia ; in English it crrespnds t the sund prduced in sheep. The phneme is S. Anyway, it is pssible t cnsider the accents: fr this reasn, fr the stressed vwel the symbl! is added after the phneme. The algrithm prvides bth phnetic transcriptins, with and withut accent. It s up t the user t chse the crrect ne. All rules implemented in the tl are nt listed in this paper because they wuld exceed the page limit. The entire algrithm rules are implemented and listed in a c# VS2010 prject and is available at this link [12]. 34
PSEUDOCODE: Input: Output: Algrithm: wrd W = {c 1, c 2,, c i,, c n1 } where c is a character (example: CASALE = {c,a,s,a,l,e}) list f phnetic transcriptins LP = {P 1, P 2,, P i,, P n2 } (example: LP= {casale,cazale}) where P = {p 1, p 2,, p i,, p n3 } is a phnetic transcriptin and p i is a phneme (example P 1 ={c,a,s,a,l,e}, P 2 ={c,a,z,a,l,e}) FOR EACH c i W I(c i,r) = { c W : d(c,c i ) < r } IF( I(c i,r) generates (p x p y...) ) FOR EACH P i LP P x <- P i + p x LP_TEMP.add(P x ) P y <- P i + p y LP_TEMP.add(P y ) LP = LP_TEMP 4. EXPERIMENTAL RESULT The ptential f the prpsed apprach, we cnsidered the transcriptins cntained in the Lexicn f Festival Speech Synthesis System (TTS) [13]. A Lexicn in Festival is a subsystem that prvides prnunciatins fr wrds. It cnsists f three distinct parts: an addendum, cnsisting f manually added wrds; a cmpiled lexicn and a methd fr dealing with wrds nt present in any list [14]. The third field cntains the actual prnunciatin f the wrd. This is an arbitrary Lisp S expressin. In many f the lexicns distributed with Festival this entry has internal frmat, identifying syllable structure, stress markings and f curse the phnes themselves. In sme f ur ther lexicns we simply list the phnes with stress marking n each vwel. Festival, within its lexicn, includes mre than 440,000 wrds. Fr each wrd all the List Expressins are remved and nly the phnetic transcriptin is extracted. The cmparisn shwed that nly 24.634 ut f 440,000 wrds were transcribed in a different way. These results cnfirm the quality f the algrithm with an Errr Rate clse t 5,59%. 5. CONCLUSION This paper presents a knwledge-based apprach t G2P cnversin applied t highly inflected free-stress Italian Language. In the future, lexical stress extractin and filtering methds shuld imprve the g2p mdels. Furthermre, we may integrate a speech synthesis cmpnent int a dictinary building prcess fr accelerated and interactive editing f imprper phnemes. The surce cde and binary f this G2P tl are available at this link [12]. 35
ACKNOWLEDGEMENTS The authrs were supprted by the Sicilian Regin grant PROGETTO POR 4.1.1.1: "Rammar Sistema Cibernetic prgrammabile d interfacce a interazine verbale" (utilizzare la dizine degli altri paper). REFERENCES [1] A. Mhamed, G. E. Dahl, and g. E. Hintn, acustic mdeling usingdeep belief netwrks, IEEE trans. On audi, speech, and language prcessing, 2011. [2] Cheap, Fast and Gd Enugh: Autmatic Speech Recgnitin with Nn-Expert Transcriptin Sctt Nvtney and Chris Callisn-Burch Center fr Language and Speech Prcessing Jhns Hpkins University [3] V. Catania, S. M. Bindi, Y. Cilan, R. Di Natale, A.R. Intilisan, G. Mntelene. 2014."An Audibks-Based Apprach fr Creating a Speech Crpus fr Acustic Mdels".(in press) [4] Tim Schlippe, Wlf Quaschningk, Tanja Schultz, cmbining grapheme-t-phneme cnverter utputs fr enhanced prnunciatin generatin in lw-resurce scenaris. Cgnitive Systems Lab, Karlsruhe Institute f Technlgy (KIT), Germany [5] J. Nvak, N. Minematsu, and K. Hirse, WFSTbased Grapheme-t-Phneme Cnversin: Open Surce Tls fr Alignment, Mdel-Building and Decding, in Internatinal Wrkshp n Finite State Methds and Natural Language Prcessing, Dnstia-San Sebastian, Spain, July 2012. [6] V. Pagel, K. Lenz, and A. W. Black, Letter t sund rules fr accented lexicn cmpressin, in Prceedings f ICSLP, 1998. [7] M. Bisani and H. Ney, Jint-sequence mdels fr grapheme-t-phneme cnversin, Speech Cmmunicatin, vl. 50, n. 5, pp. 434 451, May 2008. [8] D. Wang and King. S, Letter-t-sund prnunciatin predictin using cnditinal randm fields, IEEE Signal Prcessing Letters, vl. 18, n. 2, pp. 122 125, 2011 [9] CMU SPHINX, G2P Cnversin, (last visited [23 July 2014]), http://cmusphinx.surcefrge.net/2012/08/gsc-1012-grapheme-t-phneme-cnversin-in-sphinx-4- %E2%80%93-prject-cnclusins/ [10] L. Canepari, Dizinari di prnuncia italiana, (last visited [23 July 2014]), http://www.dipinline.it/guida/dcs/dipi_terminazini_desinenze.pdf [11] Wikipedia, (last visited [23 July 2014]), http://it.wikipedia.rg/wiki/fnlgia_della_lingua_italiana [12] DIEEI Embedded System Lab, http://ctsds.dieei.unict.it/research/index.php/dwnlad [13] Festvx prject, (last visited [23 July 2014]), http://festvx.rg/ [14] Festival Lexicn tutrial, (last visited [23 July 2014]), http://festvx.rg/http://www.cstr.ed.ac.uk/prjects/festival/manual/festival_13.html Authrs Vincenz Catania received the Laurea degree in electrical engineering frm the Università di Catania, Italy, in 1982. Until 1984, he was respnsible fr testing micrprcessr system at STMicrelectrnics, Catania, Italy. Since 1985 he has cperated in research n cmputer netwrk with the Istitut di Infrmatica e Telecmunicazini at the Università di Catania, where he is a Full Prfessr f Cmputer Science. His research interests include perfrmance and reliability assessment in parallel and distribuited system, VLSI design, lw-pwer design, and fuzzy lgic. Daniela Pann received a Dr.Ing degree in Electrical Engineering and a Ph.D. in Telecmmunicatins Engineering frm the University f Catania, Italy, in 1989 and 1993, respectively. In 1989 she jined the Department f Infrmatics and Telecmmunicatins at 36
the University f Catania, where she is nw an Assciate Prfessr f Signal Thery. Her research interests are MAN architectures and prtcls, traffic management and perfrmance evaluatin in bradband netwrks, fuzzy lgic applicatin in the telecmmunicatins field. Raffaele Di Natale is a Cntract Researcher at the DIEEI at the University f Catania. He received his M.Sc. degree in cmputer science frm Catania University, Italy, in 1997 and the Ph.D. in biinfrmatics frm Catania University in 2012. His research interests include spken dialg systems, Ggle Andrid and data mining. Antni Rsari Intilisan is currently a PhD candidate with Telecm Italia Grant, University f Catania. His dctral wrk explres the Spken Language Understanding and Autmatic Speech Recgnitin techniques specifically fr Ne-Latin Languages. Giuseppe Mntelene received his Master Degree in Cmputer Engineering in 2013 frm University f Catania. Currently he is a Cntract Researcher at DIEEI, University f Catania. His research interests include Natural Language Prcessing, Machine Learning and Frecasting Mdels. Salvatre Michele Bindi is a Cntract researcher at the DIEEI at the University f Catania. His research interests include spken dialg systems and datamining and ggle andrid. Ylenia Cilan, is a sftware engineer at A-Tn. She cllabrated with the DIEEI at Univeristy f Catania as cntract researcher. Her research interests include spken dialg systems and mrphlgical engines. 37