FiliText: A Filipino Hands-Free Text Messaging Application

FiliText: A Filipino Hands-Free Text Messaging Application Jerrick Chua, Unisse Chua, Cesar de Padua, Janelle Isis Tan, Mr. Danny Cheng College of Computer Studies De La Salle University - Manila 1401 Taft Avenue, Manila (63) 917 5120271 jerrick.chua10@gmail.com, unisse.chua@gmail.com, capidepadua@gmail.com, janellesytan@gmail.com, danny.cheng@delasalle.ph ABSTRACT This research aims to create a hands-free text messaging application capable of recognizing the Filipino language which will allow users to send text messages using speech. Using this research, other developers may be able to study further on Filipino speech recognition and its application to Filipino text messaging. Keywords Speech Recognition, Filipino Language, Text Messaging 1. INTRODUCTION Texting while driving has been a problem in most countries since the late 2000s. The Philippine National Police (PNP) reported about 15,000 traffic accidents in 2006, averaging on 41 accidents per day. It is concluded that most accidents are caused by error on the part of the driver. Additionally, traffic accidents caused by cellphone use while driving represented the highest increase among the causes of traffic accidents [2]. According to Bonabente (2008), `The Automobile Association Philippines (AAP) has called for an absolute ban on use of mobile phones while driving, saying it was the 12th most common cause of traffic accidents in the country in 2006.' AAP said that the using cell phones, even hands-free sets, while driving could impair the driver's attention and could lead to accidents. An existing software application that helps people use words to command their phones what to do is Vlingo. It is an intelligent voice application that is capable of doing a lot of things other than just allowing users to text while driving [3]. It is also a multiplatform application which is available for Apple, Android, Nokia, Blackberry, and Windows Mobile. There is another software developed by Clemson University called VoiceText. It allows the driver to send text messages while keeping their eyes on the road. Drivers using VoiceText put their mobile phones in Bluetooth mode and connect it to their car. It is through the car's speakers system or through a Bluetooth headset thar drivers are able to give a voice command and deliver a text message. StartTalking is another existing software from AdelaVoice that allows the user to initiate, compose, review, edit and send a text message entirely by voice command. However, this certain application is only available for Android 2.0 and above [4]. There are other applications that are similar to the ones mentioned above but they all have the same purpose - to help lessen the cases of car accidents caused by distracted driving. 2. SIGNIFICANCE OF RESEARCH Western countries have started to develop hands free texting applications that has helped in reducing the number of car accidents caused by texting while driving and some of these have capabilities to understand Chinese. However, the Philippines, which is considered the text capital of the world, still has no such applications that prevent drivers from texting while driving mainly because there are no Filipino language capabilities on the existing applications so far. There are party-lists and organizations that support this kind of law. Buhay party-list filed a bill seeking to penalize persons caught using their mobile phones on the road. The cities of Manila, Makati and Cebu has successfully banned this kind of act on the paper, however, it has not been properly enforced [1]. Even when there is a law that bans Filipinos from using their mobile phones while driving, it has not been strictly implemented and there are only a few ways of knowing whether a person is really following this law. The development of a local version of an existing hands-free text messaging application, Vlingo InCar, delivers an alternative for Filipinos. This service aims to keep driver's hands on the wheel and keep concentration solely to his environment. The Philippines, known as the Texting Capital in the World, may have lesser traffic accidents in the future when drivers have been granted to use hands-free mobile phones on the road rather than having to glimpse and read a text message from one of his contacts. Hands-free text messaging is not only helpful in restricting drivers from using their hands to text while driving, but it may also be used by physically disabled individual with normal speech or simply those who are used to multitasking or for some cases where a person may need to use both hands to perform an activity. An application of this would be for those busy businessmen and women who need to do a lot of things in a short amount of time due to the load of work they need to finish. Having such an application would be helpful in their daily routine because they no longer need to use their hands in sending an urgent message to their colleagues while attending to other urgent matters. Alongside with this useful application, this research will be able to shine a light on speech recognition of the Filipino language, a topic that is lacking research and in depth analysis. When deeper studies have been conducted, it could be used as a stepping stone for various studies in the future involving Filipino speech recognition. 58 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 58-63 De La Salle University, Manila, 24-25 November 2011

3. RELATED LITERATURE 3.1 Filipino Text Messaging Language Manila's lingua franca was used as the base for the Philippine's national language, Filipino, and this is commonly used in the urban areas of the country and it is also spreading fast across the country [6]. Tagalog is the structural base of the Filipino language and it was commonly spoken in Manila and the provinces of Rizal, Cavite, Laguna, and many more. As conducted on a study by Shane Snow, the Philippine is still considered the Text Capital of the World. With the constraint of Short Messaging System (SMS) of 160 characters or less to send a message, people learned how to shorten what they wanted to say that is now referred to as text speak. One simple way of shortening a message is by taking out all the vowels; however, this does not work for some words because it gives out an ambiguous feel for words that have similar consonant sequences. Phonetics or how the word sounds like also plays a role in shortening messages for Filipino texters [7]. Examples for such are Dito na kami becomes D2 n kmi and Kumain na ako becomes Kumain n ko. 3.2 Speech to Text Libraries Speech-to-Text systems are already available as desktop applications, and some of these systems give out their APIs and/or libraries for those who want to use their system to create a new desktop application. Some of these mentions systems are CMUSphinx, Android Speech Input, Java Speech API and SpinVox Create. Among all the APIs and libraries available, CMU Sphinx is the most appropriate. CMUSphinx is giving out their toolkit CMUSphinx Toolkit' which comes with various tools used to build speech applications, these tools include a recognizer library, a support library, language model and acoustic model training tools, and a decoder for speech recognition research, are just some of the tools offered by CMUSphinx. It also has a library for mobile support called as PocketSphinx. CMUSphinx can also generate its own pronunciation dictionary with the help of an existing dictionary as a basis, but the pronunciation generation code only supports English and Mandarin. 3.3 CMU Sphinx-4 Sphinx-4 is a Java-based, open source and automatic speech recognition system [8]. Sphinx-4 is made up of 3 core modules, namely, the FrontEnd, the Linguist, and the Decoder. The Decoder module is the central module, which takes in the output of the FrontEnd module and the Linguist module. From their output, it generates its results, which it passes to the calling application. The Decoder module has a single module, the SearchManager, which it uses to recognize a set of frames of features. The SearchManager is not limited to any single search algorithm, and its capabilities are further extended due to the design of the FrontEnd module. The FrontEnd module is responsible for the digital signal processing. It takes in one or more input signals and parameterizes them into features which it then passes these features off to the Decoder module. Finally, there is the Linguist module, which is responsible for generating the SearchGraph. The decoder module compares the features from the FrontEnd module against this SearchGraph to generate its results. The Linguist model is made up of 3 submodules: the AcousticModel, the Dictionary, and the LanguageModel. The AcousticModel module is responsible for the mapping between units of speech and their respective hidden Markov models. The LanguageModel module provides implementations which represent the word-level language structure. The dictionary dictates how words in the LanguageModel are pronounced. CMU Sphinx can also use a language model that is made by a user. With this language model users can create a grammar that is suitable for their own language and with the help of an acoustic model, a user can fully utilize the language model patterned after their own native language. 4. SYSTEM DESIGN 4.1 Overview FiliText is an application for desktop computers designed especially for the Filipinos. It serves as a stepping stone of future developers to create a hands free texting application for mobile phones. FiliText is a system that accepts audio files, specifically in a Waveform Audio File Format or WAV, as an input and processes it through a speech recognition API to convert the message into a text. The conversion of the message will be generated after the user has acknowledged that the voice input is already done. This application will produce two outputs. The first output will be a converted message with the proper and complete spelling in Filipino. As an option, the user may choose to compress the text output into a SMS due to the fact that most cell phone carriers allow only up to 160 characters per message. 4.2 Architecture The system will begin by first gathering input, a spoken message, through the input module. The input module will then pass off the unprocessed spoken message to the Sphinx-4 module, configured for recognizing Filipino. Sphinx-4 will then pass off the now textbased message to the message shrinking module, which will apply common methods of reducing word length. Finally, the shrunken, text-based message will be passed off to the output module, which will display said output to its user. Figure 2. Architectural Design of FiliText The system will rely on Sphinx-4 in order to convert the spoken message into its respective text format. Because Sphinx-4 is highly configurable, a speech recognition module would not need 59

to be coded from scratch. Instead, the Sphinx-4 module will be trained and configured to recognize informal Filipino. The input will first pass through the FrontEnd module, which will handle the cleaning and normalizing of the input. Little effort will be placed into configuring and optimizing the FrontEnd as it deals with digital signal processing. The Linguist module will create the SearchGraph, which the Decoder module will use to compare the input against in order to generate its results. The Linguist module will be the most configured of the three as it contains the hidden Markov models, the list of possible words and their respective pronunciations, and the acoustic models of phonemes. Sphinx-4 does not have any of the necessary files to understand Filipino, so the dictionary, the acoustic models, and the language models will be created by the proponents using the tools provided by CMU Sphinx group. The Language model will be created using a compilation of Filipino text messages, newscast transcripts, Facebook posts, and Twitter feeds. These will be placed into a text file with the format, <S> text n </S>. A vocabulary file, a text file listing all Filipino words used, will also be created and used to generate the language model. The vocabulary file will not include names and numbers. These two files will be used by the CMU-Cambridge Language Modeling Toolkit (CMUCLMTK) to create an n-gram language model. Aside from the implementation, the created language model is necessary for the creation of the Acoustic Model. The Acoustic Model submodule of the Linguist module will be trained using SphinxTrain, a tool also created by the CMU Sphinx group for generating the acoustic models a system will use. To train the acoustic model, recording from different speakers will be compiled and each audio file would be transcribed. Each recording will be a wav file sampled at 16 khz, 16-bit mono, and segmented to be between 5 to 30 seconds long [5]. The set of speakers will include males and females of 16 years of age or older. The Decoder module compares the processed input against the SearchGraph produced by the Linguist to produce its results. The Decoder's SearchManager sub-module will be configured to use the implemented SimpleBreadthFirstSearchManager, an implementation of the frame-synchronous Viterbi algorithm. The message shrinker module will use word alignment, a statistical machine translation algorithm to shorten the output of the Sphinx-4 module that will still be understandable. The output of this module would be a text with at most 160 characters unless it is stated that there is no possible way to make the text shorter than 160 characters. This shortened message will then be sent to the output module, which will display the message to the user. 4.3 Customization of Sphinx-4 Since the documentation of the Sphinx-4 has specified the steps in creating the language model and acoustic model of a new language when needed, it was somehow easy to create a prototype. The challenge in customizing the Sphinx-4 for a totally different language is getting all the recordings and having them be trained by SphinxTrain. The initial task to perform was to gather audio recordings of different speakers that will use all the phonemes of Filipino in their recordings. The audio file format must be in 16-bit mono and 16 khz and it must not be shorter than 5 seconds to aid in the accuracy of the acoustic model training. After gathering all the speech recordings, they are all placed in a folder that will be run with CMUCLMTK so that it may create the dictionary of used words and also the different phonemes it was able to detect. After being able to run it through the language modeling toolkit, it will be ready for training under SphinxTrain to create the acoustic model that the application would use to understand the different words uttered by the end-user of the application. 4.4 Data Collection The corpus the proponents will be using is a Filipino Speech Corpus (FSC) created by Guevarra, Co, Espina, Garcia, Tan, Ensomo, and Sagum of the University of the Philippines - Diliman Digital Signal Processing Laboratory in 2002. The corpus contains audio files and matching transcription files which will be used to train the acoustic model and the language model for the Filipino language and use these trained models for recognition. For the language model, contemporary sentences were also gathered from social networking sites such as Facebook and Twitter. 4.5 Training the Acoustic and Language Models To create a new acoustic model for Sphinx 4 to use, the CMU Sphinx project has created tools to aid in creating these models for recognizing speech. The required files are placed in a single folder which is considered as the speech database. Sound recordings will be placed into the directory wav/speaker_n. The dictionary, the mapping of words to their respective pronunciations, that will be used for SphinxTrain will be the same as the one used in the implementation of Sphinx-4. A single text file will be created to house the transcriptions of each recording. Each entry in the text file must follow the following format: <S>transcription of file n</s> (file_n). A file with the filenames of all the sound recordings must also be present as this will be used to map the files to the transcription. The ARPA language model will also be used by SphinxTrain to generate the acoustic models. A phoneset file, a file listing each phoneme tag, will be provided needed as well. The filler dictionary will be created to only include silences. These will all be used by SphinxTrain to generate the acoustic model. All mentioned files other than audio recordings will be placed inside a folder labeled etc. Because the FSC had recorded speech with lengths of 25 to 35 minutes each, the proponents had to segment each file into the specified 5 to 30-second length. They were able to automate the process by using SoX, an open source audio manipulation tool. This tool was able to segment the sound files according to the existing transcriptions that came with the speech corpus. After segmenting the sound files, the filenames were transcribed to a fileids file and the transcriptions of each sound file was compiled into a single file, ready for training. 60

The language model needed for the etc folder was created using the transcription file and the CMUCLMTK. The phonetic dictionary was also created with the aid of the language modeling toolkit because in the process of creating the language model itself, the toolkit will first create a dictionary file with all the words in the transcription file. The phonetic dictionary that the proponents created used the letters of the Filipino words as the phone for each letter. According to the acoustic model training documentation of SphinxTrain, this approach is done when there is no phoneme book available and it gives very good results [5]. Figure 3. Phonetic Dictionary Sample Sphinx-4 also has a mobile counterpart called PocketSphinx and this is usually used for mobile based applications that require speech recognition. It has been used to develop applications for the Apple iphone before [5]. 5. TEST RESULTS The proponents have trained three sets of acoustic model and language model using 20 speakers, 40 speakers and 60 speakers from the FSC. The proponents split the trainings to see whether or not accuracy would improve when the trained data has been increased. The proponents conducted two types of testing: controlled and uncontrolled testing. Controlled testing made use of the existing recordings from the speech corpus while uncontrolled testing was done with random people that were not from the corpus. In determining the accuracy of the system, the result text generated is compared to the correct transcription of the recording. The following formula is used to attain the accuracy rate of the system: matching_words Accuracy Rate = Total_words_in_transcription 100 The folder structure must be followed because the training process is controlled by calling Perl scripts to setup the rest of the training binaries and configuration files. Before starting the training, the train configuration file (sphinx_train.cfg) must be edited according to the size of the speech database to be trained. The variables that must be taken into consideration before training are the model parameters: the tied-states (senones) and the density. Controlled testing was done for all three trained sets and the accuracy for each speaker per set is shown in Figure 5. Figure 5. Accuracy Comparison for 3 Sets Figure 4. Approximation of Senones and Number of Densities Training internals include but is not limited to computing the features of the audio files, training the context-independent models and the context-dependent untied models, building decision trees, pruning the decision trees and finally train the context-dependent tied models. 4.6 Mobile Application Having a desktop application is very different from a mobile application because of the limitations of the mobile devices when it comes to size capacity, processing speed and many more. When moving the application to a mobile device, it would be better if the size of the whole application is smaller and would still be able to perform similarly to the desktop application. This comes as a challenge since the application would need the acoustic model, the language model and other linguistics related models that would be needed to recognize the spoken text. It can be seen that the accuracy for the 40-speaker set dropped but this was because it lacked training. For the 60-speaker set, the variables were adjusted to fit the size of the training data which in turn gave better results when compared to the 20-speaker set. Figure 6 shows the average accuracy rate for each set for the conducted controlled testing and it is evident that the accuracy of the 60-speaker set increased as compared to the 20-speaker set. The mean accuracy rate of controlled testing of each test was at 45% for 20-speaker set, 43.25% for 40-speaker set and 58.32% for 60-speaker set. Figure 6. Average Accuracy Rate for Controlled Test 61

Table 1. Sample Output Expected Output Actual Output Nandito ako Nandito ako Maganda ba yun palabas Maganda bayong palabas Tumatawag na naman siya Tumatawag nanaman siya Pauwi ka na ba Pauwi kalaban For the uncontrolled testing, the proponents designed the language model to compose of the transcriptions from UP Diliman and contemporary sentences gathered from different social networking sites. The proponents gathered 20 speakers, 10 male and 10 female, to test the system with sentences that are usually used in daily conversations. Using the trained data with 120-speaker set and the new language model, the system attained an average of 69.67% in accuracy and an error rate of 30.33%. Figure 7. Accuracy and Error Rate for Uncontrolled Test Male 5.1 Creating the Mobile Application Attempting to port the existing desktop speech recognition application to an Android device proved to be a challenge. Since the mobile version of Sphinx-4, PocketSphinx, is not well documented yet for Android, the proponents had a hard time installing the required software and actually creating the application for the mobile device. There were some sample applications available online that were on a demo level however it was tricky to install on the mobile device. Another challenge was the limitations on the existing phones that the proponents had. The demo application that was downloaded and modified was too heavy for the HTC Hero such that when the application was opened, it would close itself without any warning. Another mobile phone available for testing was a Samsung Galaxy Ace. However, the proponents have yet to test it on the said device. 5.2 Improving Performance As mentioned above, the accuracy for sentences not found in the language model was very low. The proponents are currently continuing research on how to improve the performance of the system. There are two approaches in which the proponents will tackle: a new language model would be built with sentences that consists of everyday conversational Filipino sentences and train the acoustic model to be phoneme dependent. Figure 8. Accuracy and Error Rate for Uncontrolled Test - Female For better results, speakers are recommended to speak in a clear loud voice and to avoid mispronunciations of the words. The speaker should also speak in a slower pace to make each word more distinct with one another to avoid conjunction of two different words. The accuracy of the system will also drop when there is too much background noise present. In Table 1, actual results produced by the system are seen: The new language model would be built with the help of the Department of Filipino of De La Salle University. The department would advise the researchers on what sentences are considered to be conversational Filipino and include these sentences in the language model. An additional resource for sentences that could be added to the language model would be a collection of existing text messages sent in Filipino. After this language model is completed, the system would be retrained to follow the new language model and be tested whether improvement occurred. Training the acoustic model to be phoneme dependent would aid in the system to use letter-to-sound rules to guess the pronunciation of unknown words. These unknown words are not found in the dictionary and the transcription files which mean that they were not trained to understand these words. The letter-tosound rule will attempt to guess the pronunciation of unknown words based on the existing phonemes and words in the dictionary. Again, testing would be conducted to the different sets of models trained to see whether improvement occurred with increments to the training data. The test results would also be compared to existing test results to see if unknown words would really be recognized. 62

6. CHALLENGES ENCOUNTERED Sphinx-4 is a very flexible system capable of performing many types of recognition tasks and there is a lot of documentation provided for the public. However, since this tool has not been made specifically for the Filipino language, there are a lot of modifications to be made. In the demonstration programs provided by the Sphinx-4 has low accuracy. This was due to the noise and echo included during the testing. This certain challenge was remedied by switching off noise reduction and acoustic echo cancellation in the microphone s setting. Sphinx documentation also specified that the recorded wave file should be set to 16-bit 16 khz mono to be used for the training. However, with the first set of recordings, it did not follow the specifications. By changing the sample rate of the given sound file only resulted in it being slowed. This issue was resolved after changing the sample rate of the project itself instead of the files. Although Sphinx-4 has been built and tested on the Solaris Operating Environment, Mac OS X, Linux and Win32 operating systems, CMUCLMTK requires the use of UNIX. This was remedied by using Cygwin as recommended by the Sphinx team. Line breaks in the tool also requires being in UNIX format. This issue was resolved by switching to UNIX line breaks using Notepad++. The recorded messages that would be used in the training of the system had less background noise. When the application is used in normal environment which has more noise and echo, the system s accuracy could drop. The issues of creating hands-free texting application also involve the users style of texting and speaking. The type of keypad their phone has, whether a QWERTY or T9, has a factor as to how they type their text messages. There is also a difference on how a person composes a text to the way he or she speaks in a conversation. An issue from the results is the lack of relevance of the trained language model to the messages sent for text messaging. Because of the low accuracy rate for the uncontrolled tests, the proponents believe that the language model is the main contributor to the drop of accuracy. The language model was patterned according to the speech uttered by the speakers in the FSC and these speeches include stories and words. These are not sentences that are often used in everyday texting which is why the sentences uttered by the speakers for the uncontrolled test were barely recognized by the system. 7. CONCLUSION CMU Sphinx-4 is an effective tool to develop a desktop application that is able to recognize speech in Filipino language and produce its text equivalent. The system is also able to use a simple rule-based text shortening using regular expressions to provide the users the text speak equivalent of the output produced. There are also several recommendations that future developers may do to improve the system: Firstly, to increase the data in the language model. These data may include English words since most Filipinos does not use plain Tagalog in texting but mixes English and Tagalog. Developers may also allow the user to place punctuation marks in sentences for better understanding of the result. Other commands such as starting and ending the recording for speech recognition may also be added as a feature enhancement for the application. Lastly, it is recommended that the application be ported into mobile with different operating systems such as Android and ios. 8. ACKNOWLEDGMENTS The researchers would like to thank the following: (1) Mr. Danny Cheng for being an adviser and guiding the group throughout the research, (2) for the entire panelist, namely: Mr. Allan Borra and Mr. Clement Ong for the remarks and suggestions that they have given to further improve the research, and last but not the least to (3) EEE department of the University of the Philippines Diliman for allowing us to use their Filipino Speech Corpus (FSC) for our research. 9. REFERENCES [1] Bonabente, C. L. (2008). Total ban sought on cell phone use while driving. Retrieved from http://newsinfo.inquirer.net/breakingnews/metro/view/20080 920- [2] CarAccidents. (2010). Philippines Crash Accidents, Driving, Car, Manila Auto Crashes Pictures, Statistics, Info. Retrieved from http://www.car-accidents.com/country-caraccidents/philippines-car-accidents.html [3] Vlingo Incorporated. (2010). Voice to text applications powered by intelligent voice recognition. Retrieved from http://www.vlingo.com. [4] Adela Voice Corporation. (2010). Starttalking. Retrieved from http://www.adelavoice.com/starttalking.php. [5] CMU Sphinx. (2010). CMU Sphinx Speech Recognition ToolKit. Retrieved from http://cmusphinx.sourceforge.net [6] Gonzalez, A. (1998). The Language Planning Situation in the Philppines. Journal of Multilingual and Multicultural Development, 55, 5-6. [7] BBC h2g2. (2002). Writing Text Messages. Retrieved from http://www.bbc.co.uk/dna/h2g2/a70091 [8] CMU Sphinx Group. (2011). CMU Sphinx. Retrieved from http://cmusphinx.sourceforge.net/sphinx4/ [9] Cu, J., Ilao, J., Ong, E. (2010). O-COCOSDA 2010 Philippine Country Report. Retrieved from http://desceco.org/o- COCOSDA2010/o-cocosda2010-abstract.pdf 63