High Speed Unknown Word Prediction Using Support Vector Machine For Chinese Text-to-Speech Systems

Transcription

1 High Speed Unknown Word Prediction Using Support Vector Machine For Chinese Text-to-Speech Systems Juhong Ha, Yu Zheng, Gary Geunbae Lee Department of CSE POSTECH, Pohang {miracle, zhengyu, Abstract One of the most significant problems in POS (Part-of-Speech) tagging of Chinese texts is an identification of words in a sentence, since there is no blank to delimit the words. Because it is impossible to pre-register all the words in a dictionary, the problem of unknown words inevitably occurs during this process. Therefore, the unknown word problem has remarkable effects on the accuracy of the sound in Chinese TTS (Text-to-Speech) system. In this paper, we present a SVM (support vector machine) based method that predicts the unknown words for the result of word segmentation and tagging. For high speed processing to be used in a TTS, we pre-detect the candidate boundary of the unknown words before starting actual prediction. Therefore we perform a two-phase unknown word prediction in the steps of detection and prediction. Results of the experiments are very promising by showing high precision and high recall with also high speed. 1 Introduction In Chinese TTS, identification of words and assignment of correct POS (Part-of-Speech) tags for an input sentence are very important task. These steps have considerable effects on a Chinese-textto-Pinyin conversion. Correctly converted pinyins are essential elements because they provide important information for selecting a synthesized unit in a speech database. But since there is no blank to delimit the words in Chinese sentences, we need a high quality word segmentation and POS tagging process for high quality pinyin conversion. However we can not include all the new words for segmentation and tagging in a dictionary even if we generate a word dictionary from very large amount of corpus. So, unknown word handling for correct pronunciation processing should be essential for more accurate and natural TTS sound. Various kinds of pronunciation conversion methods for alphabet language including English have been proposed. For these methods, they used mainly statistical patterns and rules for unknown word pronunciation conversion. However, because most Chinese POS taggers split the unknown words into individual characters, unknown word processing method to group these individually splitted characters is naturally required for Chinese. Also, processing the words that include Chinese polyphonic characters is essential for correct pronunciation conversion. Piniyin conversion of Chinese polyphonic characters is a fairly complex problem, since there are no clear distinguished patterns. We develop an unknown word processing method for Chinese person names, foreign transliterated names and location names among other proper nouns. For high speed and high performance processing to be useful in a TTS, we present a two-phase unknown word prediction method. At first, we pre-detect the candi-

2 date boundary of the unknown words from the result of segmentation and tagging. And then we predict unknown words using support vector machine which is one of the machine learning methods to exhibit the best performance. Organization of this paper is as follows: First, section 2 examines some related methods on Chinese unknown word processing to compare with our method. Section 3 briefly introduces POSTAG/C 1 (Ha et al., 2002), which is a Chinese segmenter and POS tagger by automatic dictionary training from large corpus. In section 4, we explain a method to quickly choose the candidate boundaries of unknown words in a sentence for high speed processing. In section 5, we propose a classification method that predicts the unknown words and assigns the correct POS tags using Support Vector Machine. In section 6, we present some experiments and analysis results for person names and location names. Finally, in section 7, we make a conclusion and propose our future works. 2 Related Works Word segmentation should be achieved as the first step for Chinese text processing. However, Chinese segmenter outputs a split into individual characters for the words that do not exist in a dictionary, and this splitting results in entirely wrong tags can be allocated in POS tagging step. Researches to solve the unknown word problems have been essential in Chinese text processing mainly to overcome the individual character splitting effects. Chen and Ma proposed a statistical method for the problem(chen and Ma, 2002). They automatically generated morphological rules and statistical rules from Sinica corpus and try to predict the unknown words. Their results show a good precision of 89% but a marginal recall of 68% for Chinese person names, foreign transliterated names and compound nouns. Zhang et al. presented a markov model based approach for Chinese unknown word recognition using a role tagging (Zhang et al., 2002). They defined a role set for every category of unknown words and recognized the unknown words by tag- 1 POStech TAGger Chinese version Figure 1: Overall architecture of the proposed method ging with the role set using Viterbi algorithm. They only provide the recognition results of Chinese person and foreign transliteration names. They report a precision of 69.88% and a recall of 91.65% for Chinese person names and a precision of 77.52% and a recall of 93.97% for foreign transliteration names. Goh et al. identified unknown words by a markov model based POS tagger and a SVM based chunker using character features (Goh et al., 2003). Their experiments using one month news corpus from the People s Daily show a precision of 84.44% and a recall of 89.25% for Chinese person names and foreign transliteration names, a precision of 63.25% and a recall of 79.36% for organization names, and a precision of 58.43% and a recall of 63.82% for unknown words in general. In this paper, we predict the unknown words using a SVM based method similar to Goh et al. (2003). However, we need a high-speed unknown word prediction method to be used in a real time voice synthesis system. Therefore, we first extract likely candidate boundaries where unknown words possibly occur in a sentence and then predict the words with these boundaries. So our method becomes a two-phase high speed processing method as shown in figure 1.

3 3 Word Segmentation and POS Tagging In our research, we used previously developed word segmentation and POS tagging system called POSTAG/C (Ha et al., 2002). POSTAG/C is a system which combines word segmentation module based on rules and a dictionary with POS tagging module based on HMM (Hidden Markov Model). The word dictionary was fully automatically acquired by POS tagged corpus and the system has high portability to serve both GB texts as well as Big5 texts. Performance of the GB version achieves the precision and recall above 95%. The detail description will be outside of the paper s scope. 4 Detection of the Candidate Boundary Each module which is a part of voice synthesis systems should be operated in real time. However, if we check all the texts to predict the unknown word from the beginning of input texts to the end, the speed may become very slow. Moreover, we need more efficient method if we take into account the slow speed of SVM which will be used in our research. SVM is one of the method to exhibit the best performance among all the machine learning methods, but slow learning and prediction time is its major shortcoming. To overcome the speed problems while not losing the accuracy, instead of examining the whole sentence, we detect the candidate boundaries where the occurrences of the unknown words are possible. As a general Chinese word segmentation system, POSTAG/C also outputs a contiguous single Chinese character, hanzi, string for the unknown words. Therefore, we can use the boundary where single Chinese characters appear consecutively as the first candidates of the unknown words. Studies that show more than 90% of the unknown words are actually included in this boundary in Chinese theoretically support our approach (Lv et al., 2000). Without stopping here, we extend our target boundary to increase the recall of the boundary detection by including 2-character words that exist around a single character and match to the hanzi bigram patterns with more than specified frequency. So, our system can cover the case such as in figure 2: We can not use the sequence of all the single Figure 2: Example of boundary detection including 2-character unknown words Chinese characters as the candidate boundaries because a single character very frequently can be used as a word. In our own statistics using the Chinese news paper, the number of total boundaries that are series of a single character in person names was 128,410 cases, but only 16,955 cases among them actually include the unknown words. To cope with these spurious cases, we select the candidate boundaries for a series of single characters by matching to the pre-learned hanzi bigram patterns. These patterns are learned by person names and location names which are extracted from a training data. We generated the patterns by combining two characters which are adjacent in person names or location names. There are 34,662 person patterns and 15,958 location patterns used in our system. We select the boundaries where match with more than one bigram pattern. 5 SVM-based Prediction of Unknown Word We predict the unknown words from the output of the candidate boundary detection. We use a library for support vector machines, LIB- SVM (Chang and Lin, 2003) for our experiments. Kernel function is a RBF which can achieve the best parameters for training in generating the final model. 5.1 SVM Features We use 10 features for SVM training as in table 1. Table 1: Features for support vector machine location features i-2 character and position tag i-1 character and position tag i character and position tag i+1 character and position tag i+2 character and position tag i : current position

4 Each character in the boundary predicts its own position tag (see section 5.2) using lexical and position tag features of previous and next two characters. Moreover, we use additional features such as a possible character in a family name of Chinese person and foreign transliteration, and the last character of a location name. The number of features of a family name is taken from top 200, which are most frequently used in China, and the number of features of foreign trasliteration is 520. We also use high frequency 100 features of the last character of location names from in our corpus. Using the individual characters as features for prediction is useful because we have to deal with the unknown words which are contiguous single characters. The character based features allow the system to predict the unknown words more effectivly in this case as shown in (Goh et al., 2003). 5.2 Candidate Boundary Prediction We develop a SVM based unknown word prediction method for the output of the candidate boundary detection. We give a position tag for each character and create features which are used in training and testing. The prediction first assigns the presumed position tags to the characters in a candidate boundary. Then we combine those characters according to the information of position tags, and finally identify the whole unknown word. During the unknown word prediction step, we use 4 different classes of position tags to classify the characters. These classes are [B-POS], [I- POS], [E-POS] and [O], where POS is a POS tag of a word such as person name or location name, and B, I and E are the classes of characters according to positions in the word (B: Begin Position; I: Inside Position; E: End Position). O is the class of outside characters which are not classified into the previous three classes. After the prediction step, we combine these characters as a single word. Finally, we carry out some postprocessing using the error correction rules such as the following: P T i : [O], P T i 1 : [B NR], P T i+1 : [E NR] P T i : [I NR] where P T i is a current position tag, P T i 1 is a previous position tag, P T i+1 is a next position tag, and NR is a POS for a person name. Figure 3 shows an example of the final result of our unknown word prediction. Figure 3: Example of the SVM-based prediction 6 Experiments 6.1 Corpus and Preprocessing In this section, we show the prediction results of Chinese person names, foreign person transliterations and Chinese location names. The corpus in our experiments is one-month news articles from the People s Daily. We divide the corpus into 5 parts and conducted 5-cross validation. We delete all person names and location names from the dictionary to test the unknown word prediction performance. There are 17,620 person names and 24,992 location names. For more efficient experiments, we pre-processed the corpus; Chinese person names were originally splitted into the family name and the first name in the original the People s Daily corpus, and the compound words were also splitted into each component word. Therefore, we combined those splitted words into a single word. Then, dictionary was generated from the pre-processed corpus. 6.2 Experiments and the Results The experiments can be divided into three parts. First experiment is to show how exactly our method selects the candidate boundary of an unknown word. The reduced amount of total boundaries to be recognized by SVM and the possible loss of unknown word candidates after applying our boundary detection step are shown in table 2 and 3, for person and location, respectively.

5 Table 2: Reduction of the candidate boundaries (person) before after reduction rate # of total 128,410 20, % boundary # of boundary including actual 16,955 14, % unknown words Table 3: Reduction of the candidate boundaries (location) before after reduction rate # of total 137,593 46, % boundary # of boundary including actual 23,287 22, % unknown words As shown in the below tables, even if a few real person names and location names are excluded from the candidates (13,23% and 3.05%), the number of total boundaries for SVM to predict is drastically reduced by 84.09% and 76.75% respectively. We confirmed through our experiments that those missing candidates do not affect the overall performance for final SVM-based prediction. Secondly, table 4 shows the speed gain according to the candidate selection method. For the tar- Table 4: The gain of total prediction speed by using the candidate selection candidate prediction total selection time time time (ms) (ms) (ms) 160 before 82,756 82,756 sentences after 140 2,930 3, before 171, ,942 sentences after 290 6,980 7,270 get test data of 160 sentences and 300 sentences, we can get speed improvement over more than 25 times. Finally, we tested the overall performance of the SVM-based unknown word prediction on the result of the candidate boundary selection. We divided the test corpus into 5 parts and evaluated them by 5-cross validation. Experiment results are measured in terms of precision, recall and F- measure, which are defined as equation (1), (2) and (3) below: # of correctly predicted unknown words precision = # of total predicted unknown words (1) # of correctly predicted unknown words recall = # of total unknown words F meaure = 2 precision recall precision + recall Table 5 shows the final results of the SVM based prediction for person names and location names. Table 5: Prediction performance for person names and location names precision recall F-measure person 88.06% 90.96% 89.49% name location 90.93% 91.34% 91.14% name The result of the prediction is quite promising; Recall is very high as well as the precision compared with the previous results in similar environments. So, we can verify that SVM-based method using character features is a good approach for Chinese unknown word prediction. And the additional features such as Chinese family names, trainsliterated foreign names and the last characters of the location names, help to increase the performance of the prediction. Since our SVM was trained by somewhat unbalanced data, there were some over-predicted results in the output, where our postprocessing also plays a major role to increase the final performance. 7 Conclusion The unknown word problem has remarkable effects on the accuracy and the naturalness of the sound in Chinese TTS systems. In this paper, we present a two-phase method for high speed unknown word prediction to be usable in a TTS system. We first pre-detect the candidate boundary of the unknown words from the result of Chinese segmentation and tagging. And then we predict the unknown words using the support vector machine. Experimental results are very promising by (2) (3)

6 showing high precision and high recall with also high speed. In the future, we would combine the proposed method with our automatic Text-to-Pinyin conversion module. Then we will be able to achieve more accurate conversion results. Also, to achieve better performance of uknown word prediction, we would apply our method to other classes such as organization names and more general compound nouns. Acknowledgements This research was supported by grant No. (R ) from the basic research program of the KOSEF (Korea Science and Engineering Foundation). References Chih-Chung Chang and Chih-Jen Lin LIBSVM: a Library for Support Vector Machines. a guide of beginners, cjlin/libsvm. Keh-Jiann Chen and Wei-Yun Ma Unknown word extraction for chinese documents. In Proceedings of COLING-2002, pages Chooi-Ling Goh, Msasayuki Asahara, and Yuji Matsumono Chinese unknown word identification using character-based tagging and chunking. In Proceedings of the 41th ACL Conference, pages Ju-Hong Ha, Yu Zheng, and Gary G. Lee Chinese segmentation and pos-tagging by automatic pos dictionary training. In Proceedings of the 14th Conference of Korean and Korean Information Processing, pages 33-39, (In Korean). Ya-Jan Lv, Tie-Jun Zhao, Mu-Yun Yang, Hao Yu, and Sheng Li Leveled unknown chinese word recognition by dynamic programming. Journal of Chinese information, Vol.15 No.1 (In Chinese). Kevin Zhang, Qun Liu, Hao Zhang, and Xue-Qi Cheng Automatic recognition of chinese unknown words based on roles tagging. In Proceedings of the 1st SIGHAN Workshop on Chinese Language Processing, COLING-2002.