An Unsupervised Method for Identifying Loanwords in Korean

Transcription

1 An Unsupervised Method for Identifying Loanwords in Korean Hahn Koo San Jose State University Manuscript to appear in Language Resources and Evaluation The final publication is available at Springer via

2 Abstract This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese. Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm; Korean 1

3 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from words in a foreign language. Their forms, both pronunciation and spelling, are often nativized. Their pronunciations adapt to conform to native sound patterns. Their spellings are transliterated using the native script and reflect the adapted pronunciations. For example, flask [flæsk] in English becomes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concerned with building a system that scans Korean text and identifies loanwords 1 spelled in Hangul, the Korean alphabet. Such a system can be useful in many ways. First, one can use the system to collect data to study various aspects of loanwords (e.g. Haspelmath and Tadmor, 2009) or develop machine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight, 2009). Loanwords or transliterations (e.g. 플라스크 ) can be extracted from monolingual corpora by running the system alone. Transliteration pairs (e.g. <flask, 플라스크 >) can be extracted from parallel corpora by first identifying the output with the system and then matching input forms based on scoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second, the system allows one to use etymological origins of words as a feature and be more discrete in text processing. For example, grapheme-to-phoneme conversion in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri, 2008) can be improved by keeping separate rules for native words and loanwords. The system can be used to classify a given word into either category and apply the proper set of rules. The loanword identification system envisioned here is a binary, characterbased n-gram classifier. Given a word (w) spelled in Hangul, the classifier decides whether the word is of native (N) or foreign (F ) origin by Bayesian classification, i.e. solving the following equation: ĉ(w) = arg max P (w c) P (c) (1) c {N,F } The likelihood P (w c) is calculated using a character n-gram model specific 1 In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn, 1999). 2

4 to that class. The classifier is trained on a corpus in an unsupervised manner, building on seed words extracted from the corpus. The native seed consists of words with high token frequency in the corpus. The idea is that frequent words are more likely to be native words than foreign words. The foreign seed consists of words that contain what appear to be traces of vowel insertion. Korean does not have words that begin or end with consonant clusters. Like many other languages with similar phonotactics (e.g. Japanese), foreign words with consonant clusters are transliterated with vowels inserted to break the clusters. So presence of substrings that resemble traces of insertion suggests that a word may be of foreign origin. An obvious problem is deciding what those traces look like a priori. Here the problem is resolved by a heuristic based on phoneme co-occurrence statistics and rudimentary ideas and findings in phonology. The rest of the paper is organized as follows. In Section 2, I discuss previous studies in foreign word identification as well as ideas and findings in phonology that the present study builds on. I describe the proposed method for developing the unsupervised classifier in detail in Section 3. I discuss experiments that evaluate the effectiveness of the method in Korean in Section 4 and pilot experiments in Japanese that explore its applicability to other languages in Section 5. I conclude the paper in Section 6. 2 Background This work is motivated by previous studies on identifying loanwords or foreign words in monolingual data. Many of them rely on the assumption that distribution of strings of sublexical units such as phonemes, letters, and syllables differs between words of different origins. Some write explicit and categorical rules stating which substrings are characteristic of foreign words (e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllable n-gram models separately for native words and foreign words and compare the two. It has been shown that the n-gram approach can be very effective in Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001). Training the n-gram models is straightforward with labeled data in which words are tagged either native or foreign. But creating labeled data can be expensive and tedious. In response, some have proposed methods for generat- 3

5 ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldberg and Elhadad (2008) for Hebrew. In both studies, the authors suggest generating pseudo-loanwords by applying transliteration rules to a foreign lexicon such as the CMU Pronouncing Dictionary. They suggest different methods for generating pseudo-native words. Baker and Brew extract words with high token frequencies in a Korean newswire corpus assuming that frequent words are more likely to be native than foreign. Goldberg and Elhadad extracted words from a collection of old Hebrew texts assuming that old texts are much less likely to contain foreign words than recent texts. The approach is effective and a classifier trained on the pseudo-labeled data can perform comparably to a classifier trained on manually labeled data. Baker and Brew trained a logistic regression classifier using letter trigrams on about 180,000 pseudo-words, half pseudo-korean and half pseudo-english. Tested on a labeled set of 10,000 native Korean words and 10,000 English loanwords, the classifier showed 92.4% classification accuracy. In comparison, the corresponding classifier trained on manually labeled data showed 96.2% accuracy in a 10-fold cross-validation experiment. The pseudo-annotation approach obviates the need to manually label data. But one has to write a separate set of transliteration rules for every pair of languages. In addition, the transliteration rules may not be available to begin with, if the very purpose of identifying loanwords is to collect training data for machine transliteration. The foreign seed extraction method proposed in the present study is an attempt to reduce the level of language-specificity and demand for additional natural language processing capabilities. The method essentially equips one with a subset of transliteration rules by presupposing a generic pattern in pronunciation change, i.e. vowel insertion. The method should be applicable to many language pairs. The need to repair consonant clusters arises for many language pairs and vowel insertion is a repair strategy adopted in many languages. Foreign sound sequences that are phonotactically illegal in the native language are usually repaired rather than overlooked. A common source of phonotactic discrepancy involves consonant clusters: different languages allow consonant clusters of different complexity. Maddieson (2013) identifies 151 languages that allow a wide variety of consonant clusters, 274 languages that allow only a highly restricted set of clusters, and 61 languages that do not allow clusters at all. Illegal clusters are repaired by vowel insertion or consonant deletion, but vowel insertion appears to be cross-linguistically more common (Kang, 2011). 4

6 The vowel insertion pattern is initially characterized only generically as insert vowel X in position Y to repair consonant cluster Z. The generic nature of the characterization ensures language-neutrality. But in order for the pattern to be of any use, one must eventually flesh out the details and provide instances of the pattern equivalent to specific transliteration rules: insert [u] between the consonants to repair [sm] or [sm] [sum], for example. Here the language-specific details of vowel insertion are discovered from a corpus in a data-driven manner but the search process is guided by findings and ideas in phonology. As will be described in detail below, possible values of which vowel is inserted where are constrained based on typological studies of loanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g. Hall, 2011). Possible consonant sequences originating from a cluster are delimited by the idea of sonority sequencing principle (e.g. Clements, 1990). 3 Proposal The goal is to build a Bayesian classifier made of two character n-gram models: one for native words (N) and the other for foreign words (F ). That is, ĉ(w) = arg max P (c) P (w c) arg max c {N,F } c {N,F } P (c) i P (g i g i 1 i n+1, c) (2) where g i is the i th character of w and g i 1 i n+1 is the string of n 1 characters preceding it. In this study, the n-gram models use Witten-Bell smoothing (Witten and Bell, 1991) for its ease of implementation. That is, P (g i g i 1 i n+1, c) = (1 λ c(g i 1 i n+1 )) P mle(g i g i 1 i n+1, c)+λ c(g i 1 i n+1 ) P (g i g i 1 i n+2, c) (3) So the parameters of the classifier consist of P (c), P mle (g i g i 1 i n+1, c), and λ c (g i 1 i n+1 ). They can be estimated from data as follows: P (c) = w c 5 z(w, c) w z(w, c ) (4)

7 P mle (g i g i 1 i n+1, c) = w freq w(gi n+1) i z(w, c) w freq w(g i 1 i n+1 ) z(w, c) (5) λ c (g i 1 i n+1 ) = w freq w(g i 1 i n+1 ) z(w, c) N 1+ (g i 1 i n+1 ) + w freq w(g i 1 i n+1 ) z(w, c) (6) z(w, c) indicates whether w is classified c: z(w, c) = 1 if it is and z(w, c) = 0 otherwise. freq w (x) is the number of times x occurs in w. N 1+ (g i 1 i n+1 ) is the number of different n-grams prefixed g i 1 i n+1 that occur at least once. The challenge here is that the training corpus is unlabeled, i.e. z(w, c) is hidden. I use variants of the EM algorithm to iteratively guess z(w, c) and update the parameters. The n-gram models are initialized with seed words extracted from the corpus. For the native class, I use high frequency words in the corpus as seed words: for example, all words whose token frequency is in the 95th percentile. For the foreign class, I first use sublexical statistics to list phoneme strings that would result from vowel insertion and then use words that contain the phoneme strings as seed words. Below I describe in detail how foreign seed words are extracted and how the seeded classifier is iteratively trained. 3.1 Foreign seed extraction The method aims to identify loanwords whose original forms contain consonant clusters and use them as foreign seed words. This is done by string/pattern matching, where the pattern consists of phoneme strings that can result from vowel insertion. Consonant clusters do not begin or end syllables in Korean. When foreign words are borrowed, consonant clusters are repaired by inserting a vowel somewhere next to the consonants to break the cluster into separate syllables. Speakers usually insert the same vowel in the same position to repair a given consonant cluster. As a result, transliterations of different words with the same consonant cluster all share a common substring showing trace of insertion. For example, 트라이 (try), 트레인 (train), 트리 (tree), 트롤 (troll), and 트루 (true) all have 트ㄹ which is pronounced [tʰɨɾ]. The idea is to figure out what those signature substrings are in advance and look for words that have them. There is a risk of false positives since 6

8 such substrings may exist for reasons other than vowel insertion. But the hope is that the seeded classifier will gradually learn to be discrete and use other substrings in words for further disambiguation. The phoneme strings defining the pattern are specified below as tuples of the form < C 1 C 2, V id, V loc > for ease of description. Each tuple characterizes a phoneme string made of two consonants and a vowel. C 1 and C 2 are the two consonants. V id is the identity of the vowel. V loc is the location of the vowel relative to the consonants, i.e. between, before, or after the consonants. For example, <s, n, ɨ, between> means [sɨn] as in [sɨnou] for 스노우 (snow) and <n, tʰ, ɨ, after> means [ntʰɨ] as in [hintʰɨ] for 힌트 (hint). The idea is to use C 1 C 2 to specify consonants from a foreign cluster and V id and V loc to specify which vowel is inserted where to repair the cluster. Rather than manually listed using language expertise, the tuples are discovered from a corpus using the following heuristic: 1. List words that appear atypical compared with the native seed words. 2. Extract < C 1 C 2, V id, V loc > tuples from the atypical words where (a) C 1 C 2 respects the sonority sequencing principle. (b) V id and V loc most strongly cooccur with C 1 C 2 among all vowels. 3. Identify the most common V id as the default vowel used for insertion. Keep tuples whose V id matches the default vowel and throw away the rest. 4. Identify the most common V loc of the default vowel as its site of insertion for clusters in each syllable position (onset or coda). Keep tuples whose V loc matches the identified site of insertion and throw away the rest. The basic idea is to find recurring combinations of two consonants that potentially came from a foreign cluster and a vowel. Step 1 defines the search space. It should be easier to see the target pattern if we zeroed in on loanwords. Native words have various morphological patterns that can obscure the target pattern. Of course, it is not yet known which words are loanwords. So instead the method avoids words similar to what are currently believed to be native words, i.e. the native seed words. Put differently, words dissimilar to the native seed words are tentatively loanwords. Here the similarity is measured by a word s length-normalized probability according to a character 7

9 n-gram model trained on the native seed words: 1/ w log P (w) for word w of length w. A word is atypical if its probability ranks below a threshold percentile (e.g. 5%). Step 2 generates a first-pass list. Condition 2a delimits possible consonant sequences from a foreign cluster. According to the sonority sequencing principle, consonants in a syllable are ordered so that consonants of higher sonority appear closer to the vowel of the syllable. There are different proposals on what sonority is and how different classes of consonants rank on the sonority scale (e.g. Clements, 1990; Selkirk, 1984; Ladefoged, 2001). Here I simply classify consonants as either obstruents or sonorants (see Table 1) and stipulate that sonorants have higher sonority than obstruents. I also assume that the sonority of consonants do not change during transliteration although their identities may change. For example, free changes from [fɹi] to [pʰɨɾi], but [pʰ] remains an obstruent and [ɾ] remains a sonorant. Accordingly, C 1 C 2 must be obstruent-sonorant if it is from an onset cluster and sonorant-obstruent if it is from a coda cluster. To determine with certainty whether the consonants originally occupied onset or coda, I focus on phoneme strings found only at word boundaries. If C 1 C 2 are the first two consonants of a word, they are from onset. If they are the last two consonants of a word, they are from coda. <Insert Table 1 here> Condition 2b is used to guess the vowel inserted to repair each cluster. Only one vowel is repeatedly used so its co-occurrence with the consonants should not only be noticeable but most noticeable among all vowels. Here the co-occurrence tendency is measured using point-wise mutual information: P MI(C 1 C 2, V ) = log P (C 1 C 2, V ) log P (C 1 C 2 ) P (V ) where V =< V id, V loc >. The list is truncated to avoid false positives in steps 3 and 4. This is done by identifying the default vowel insertion strategy and keeping only the tuples consistent with it. Exactly which vowel is inserted where to repair a consonant cluster is context-specific. But a language that relies on vowel insertion for repair usually has a default vowel inserted in typical locations (cf. Uffmann, 2006). Here it is assumed that the default vowel is the one used to repair most diverse consonant clusters. So it is the most frequent vowel in the list. Similarly, its default site of insertion is in principle its most frequent location in the list. But possible sites of insertion differ for onset 8

10 clusters and coda clusters: before or between the consonants in onset, but after or between the consonants in coda (Hall, 2011). So the default site of insertion is identified separately for onset and coda. 3.2 Bootstrapping with EM The parameters (θ) to estimate are P (c), P mle (g i g i 1 i n+1, c), and λ c(g i 1 i n+1 ). The first parameter P (c) is initialized according to some assumption about what proportion of words in the given corpus are loanwords. For example, if one assumes that 5% are loanwords, P (N) = 0.95 and P (F ) = The latter two parameters, which define the n-gram models, are initialized using the seed words as if they were labeled data: z(w, N) = 1 and z(w, F ) = 0 for native seed words and z(w, N) = 0 and z(w, F ) = 1 for foreign seed words. Note that other words in the corpus are not used to initialize the n-gram models. The initial parameters are then updated on the whole corpus by iterating the following two steps until some stopping criterion is met. E-step: Calculate the expected value of z(w, c) using current parameters. E[z(w, c)] = P (c w; θ (t) ) = P (w c; θ(t) ) P (c; θ (t) ) c P (w c ; θ (t) ) P (c ; θ (t) ) (7) M-step: Transform the expected value to ẑ(w, c), i.e. some estimate of z(w, c), and plug it into equations (4-6) to update the parameters. I experiment with three versions of the algorithm in the present study: soft EM, hard EM, and smoothstep EM. The three differ with respect to how E[z(w, c)] is transformed to ẑ(w, c). In soft EM, which is the same as the classic EM algorithm (Dempster et al., 1977), there is no transformation, i.e. ẑ(w, c) = E[z(w, c)]. In hard EM, ẑ(w, c) = 1 if c = arg max c E[z(w, c )] and ẑ(w, c) = 0 otherwise. Since there are only two classes here, this is equivalent to applying a threshold function at 0.5 to E[z(w, c)]. In smoothstep EM, a smooth step function is applied instead of the threshold function: ẑ(w, c) = f 3 (E[z(w, c)]) where f(x) = 2x 3 + 3x 2. Figure 1 illustrates how E[z(w, c)] is transformed to ẑ(w, c) by the three variants of the EM algorithm. <Insert Figure 1 here> 9

11 As will be shown in the experiments below, soft EM is aggressive while hard EM is conservative in recruiting words to the foreign class. Soft EM gives partial credit even to words that are very unlikely to be foreign according to the current model. Over time, such words may manage to gain enough confidence and be considered foreign. Some of them may turn out to be false positives. On the other hand, hard EM does not give any credit even to words that are just barely below the threshold to be considered foreign. Some of them may turn out to be false negatives. Smoothstep EM is a compromise between the two extremes. It virtually ignores words that do not stand a chance but gives due credit to words that barely missed. 4 Experiments Experiments show that the proposed approach can be effective in Korean despite its unsupervised nature. Classifiers built on a raw corpus with minor preprocessing (e.g. removing tokens with non-hangul characters) identify loanwords in test lexicons well. The foreign seed extraction method correctly identifies the default vowel insertion strategy in Korean loanword phonology. The resulting classifier performs better when initialized with the proposed seeding method than random seeding. Its performance is not that far behind the corresponding supervised classifier either. Moreover, after exposure to the words (but not their labels) used to train the supervised classifier, the unsupervised classifier performs at a level comparable to the supervised classifier. I discuss the details of the experiments below. 4.1 Methods I use four datasets called SEJONG, KAIST, NIKL-1, and NIKL-2 below. SEJONG and KAIST are unlabeled data used to initialize and train the unsupervised classifier. SEJONG consists of 1,019,853 types and 9,206,430 tokens of eojeols, which are character strings delimited by white space equivalent to words or phrases. The eojeols are from a morphologically annotated corpus developed in the 21st Century Sejong Project under the auspices of the Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language (2011). They were selected by 10

12 extracting Hangul character strings delimited by white-space after removing punctuation marks. Strings that contained non-hangul characters (e.g. 12 월의, Farrington 으로부터 ) were excluded in the process. KAIST consists of 2,409,309 types and 31,642,833 tokens of eojeols from the KAIST corpus (Korea Advanced Institute of Science and Technology, 1997) extracted in the same way as SEJONG. NIKL-1 and NIKL-2 are labeled data used to test the classifier. They are made of words from various language resources released by the National Institute of the Korean Language (NIKL). NIKL-1 consists of 49,962 native words and 21,176 foreign words selected from two lexicons NIKL (2008, 2013). NIKL-2 consists of 44,214 native words and 18,943 foreign names selected from four reports released by NIKL (2000a,b,c,d) and a list of transliterated names of people and places originally spelled in Latin alphabets (NIKL, 2013). I examined the words manually and labeled them either native or foreign. Words of unknown or ambiguous etymological origin were excluded in the process. SEJONG and NIKL-1 are mainly used to examine effectiveness of the proposed methods. KAIST and NIKL-2 are used to examine whether the methods are robust to varying data. See Table 2 for a summary of data sizes. <Insert Table 2 here> The proposed methods are implemented as follows. All n-gram models are trained on character bigrams, where each Hangul character represents a syllable. The high frequency words defining the native seed are eojeols whose token frequency is above the 95th percentile in a given corpus. When extracting the foreign seed, the so-called atypical words are eojeols whose lengthnormalized n-gram probabilities lie in the bottom 5% according to the model trained on the native seed. Their phonetic transcriptions are generated by applying simple rewrite rules in Appendix A. For bootstrapping, the prior probabilities are initialized to P (c = N) = 0.95 and P (c = F ) = The parameters of the classifier are iteratively updated until the average likelihood of the data improves by no more than 0.01% or the number of iterations reaches 100. Classification performance is measured in terms of precision, recall, and F- score. Here, precision (p) is the percentage of words correctly classified as foreign out of all words classified as foreign. Recall (r) is the percentage of words correctly classified as foreign out of all words that should have been classified as foreign. F-score is a harmonic mean of the two with equal empha- 11

13 sis on both, i.e. F = 2 p r/(p+r). To put the numbers in perspective, scores of classifiers built using the proposed methods are compared with those of supervised classifiers and randomly seeded classifiers. Supervised classifiers are trained and tested on the labeled data (NIKL-1 or NIKL-2) using five-fold cross-validation. The labeled data is partitioned into five equal-sized subsets. The supervised classifier is trained on four subsets and tested on the remaining subset. This is repeated five times for the five different combinations of subsets. Randomly seeded classifiers are unsupervised classifiers with just a different seeding strategy: 5% of words in the corpus are randomly chosen as foreign seed words and the rest are native seed words. For fair comparison, the unsupervised classifiers are also tested five separate times on the five subsets of labeled data that the supervised classifier is tested on. Accordingly, classification scores reported below are the arithmetic means of scores on the five subsets. 4.2 Results and discussion The foreign seed extraction method correctly identifies the default vowel insertion strategy. Table 3 lists the number of different consonant clusters for which each vowel in Korean is selected as the top candidate. [ɨ] is predicted to be the default vowel as it is chosen most often overall. Its predicted site of insertion for onset clusters is between the consonants of each cluster as it is chosen more often there than before the consonants. Similarly, its predicted site of insertion for coda clusters is after the consonants of each cluster rather than between the consonants. <Insert Table 3 here> The 28 phoneme strings made of the default vowel and the consonant pairs it allegedly separates are listed in the row labeled SEJONG in Table 4. They specify what traces of vowel insertion would look like and define the pattern matched against the atypical words to extract the foreign seed. All but three of them indeed occur as traces of vowel insertion in one or more loanwords in the entire data used for the present study. The foreign seed consists of 2,500 eojeols (out of 50,992 atypical ones) that contain one or more of the phoneme strings. The foreign seed does contain false positives, but their proportion is not that big: 489/2500 (=19.56%). Since SEJONG is unlabeled and too large, it is hard to tell what percentage of loanwords the foreign seed 12

14 represents. But if one extracted all atypical words in NIKL-1 that contained the phoneme strings, it would return a foreign seed containing 458/21,176 = 2.16% of all the loanwords in the dataset. So the foreign seed is small in size and represents a tiny fraction of loanwords. <Insert Table 4 here> The seeded classifier can be trained effectively with smoothstep EM (see row 2 in Table 5 for scores). Despite the small seed, recall is high (85.51%) without compromise in precision (94.21%). The scores are, of course, lower than those of the supervised classifier (see row 1 in Table 5). Precision is lower by 2.67% points and recall is lower by 10.95% points. But considering the unsupervised nature of the approach, the scores are encouraging. The classifier performs better when trained with smoothstep EM than the other two variants of EM (see rows 4 and 5 in Table 5). Precision is just as high but recall is a bit lower (80.16%) when trained with hard EM. On the other hand, precision is miserable (47.81%) although recall is higher (91.46%) when trained with soft EM. Figure 2 illustrates how well the classifier performs on NIKL-1 over time as it is iteratively trained on SEJONG with the three variants of EM. Right after initialization, scores of the classifier are precision = 93.82% and recall = 52.07%. All three variants boost recall significantly within the first several iterations. Soft EM is the most successful, followed by smoothstep EM, and then hard EM. But while the other two not only maintain but also marginally improve precision, soft EM steadily loses precision throughout the whole training session. <Insert Figure 2 here> Bootstrapping is more effective with the proposed seeding method than random seeding. Scores of three different randomly seeded classifiers trained with smoothstep EM are listed in rows 6-8 in Table 5. Compared to the proposed classifier, although their precision is higher by around 1% point, their recall is lower by around 14% points. But their performance is rather consistent as well as strong and deserves a closer look. The three randomlyseeded classifiers all followed a similar trajectory as they evolved. To briefly describe the process using a clustering analogy, the foreign cluster, which started out as a small random set of 50,992 eojeols, immediately shrank to a much smaller set including those with hapax character bigrams whose type frequency is one. For one of the three classifiers, the foreign cluster shrank to a set of 5,421 eojeols as soon as training began and 2,061 of them 13

15 contained hapax bigrams. It is likely that many words containing hapax bigrams were loanwords and the foreign cluster eventually grew around them. In fact, among 4,378 words in NIKL-1 containing character bigrams that appear only once in SEJONG, 1,601 are native words and 2,777 are loanwords. The process makes intuitive sense. At the beginning, the foreign cluster is overwhelmed in size by the native cluster and unlikely to have homogeneous subclusters due to random initialization. Eojeols in the foreign cluster will be absorbed by the native cluster unless they have bigrams that seem alien to the native cluster. Hapax bigrams would be a prime example of such bigrams and as a result they figure more prominently in the foreign cluster. Loanwords are alien to begin with, so it makes sense that they are more likely to have hapax bigrams than native words. The dynamics involving data size, randomness, hapax bigrams, and loanwords are indeed interesting and did lead to good classifiers. But at the moment, it is not clear if they are reliable and predictable. More importantly, the proposed seeding method led to significantly better classifiers. Robustness to noise: The proposed methods are effective despite some noise in training data. There are two sources of noise in SEJONG: crude grapheme-to-phoneme conversion (G2P) and lack of morphological processing. G2P generates phonetic transcriptions required for foreign seed extraction. In the experiments above, the transcriptions were generated by applying a rather simple set of rules. Grapheme-phoneme correspondence in Hangul is quite regular, but there are phonological patterns such as coda neutralization and tensification (Sohn, 1999) that the rules do not capture. Accordingly, the resulting transcriptions would be decent approximations but occasionally incorrect. In fact, when the rules are tested on 14,007 words randomly chosen from the Standard Korean Dictionary, word accuracy and phoneme accuracy are 67.92% and 94.67%. One could ask if the proposed methods would perform better with more accurate transcriptions. An experiment with a better G2P suggests that the approximate transcriptions are good enough. A joint 5-gram model (Bisani and Ney, 2008) was trained on 126,068 words from the Standard Korean Dictionary. The model transcribes words in SEJONG differently from the rules: by 36.62% in terms of words and 5.53% in terms of phonemes. The model s transcriptions are expected to be more accurate. Its word accuracy and phoneme accuracy on the 14,007 words mentioned above are 95.30% and 99.35%. Building the classifier from scratch using the new transcriptions barely changes 14

16 results. The foreign seed extraction method again correctly identifies the default vowel insertion strategy. It identifies [ɨ] as the default vowel, inserted between the consonants in onset and after the consonants in coda. It picks 31 phoneme strings including the vowel as potential traces of insertion (see SEJONG-g2p in Table 4). All but four of them have example loanwords in which they occur as traces of vowel insertion. The set of phoneme strings is similar to the one identified before, with a 73.53% overlap between the two. The resulting foreign seed is even more similar to the previous seed, with a 84.35% overlap between the two. The new seed is slightly larger than the previous seed (2,527 vs. 2,500 words) but has a higher proportion of false positives (20.66% vs %). The two seeds lead to very similar classifiers trained with smoothstep EM. The two trained classifiers tag 99.39% of words in NIKL-1 in the same way and their scores differ by only 0.24% 0.48% points (see row 9 in Table 5 for the new classification scores). The training data in the experiments above include eojeols containing both native and foreign morphemes. Loanwords can be suffixed with native morphemes, combine with native words to form compounds, or both. A good example is 투자펀드를 (investment-fund-acc) where 투자 and 를 are native and 펀드 is foreign. Such items may mislead the classifier to recruit false positives during training. One could ask if performance of the proposed methods can be improved by stemming or further morpheme segmentation. Experiments suggest that they improve precision but at the sacrifice of recall. Data for the experiments consists of a set of 250,844 stems and a set of 132,430 non-suffix morphemes in SEJONG. Eojeols in SEJONG are morphologically annotated in the original corpus. For example, 투자펀드를 is annotated 투자 /NNG + 펀드 /NNG + 를 /JKO. Stems were extracted by removing substrings tagged as suffixes and particles (e.g. 투자펀드를 투자펀드 ). Non-suffix morphemes were extracted by splitting the derived stems at specified morpheme boundaries (e.g. 투자펀드 투자 and 펀드 ). Two classifiers were built from scratch with rule-based transcriptions: one using the stems and the other using the morphemes. The foreign seed extraction method is effective as when it was applied to eojeols. It correctly identifies the default vowel and its site of insertion in both data sets. The phoneme strings identified as potential traces of insertion are listed in rows labeled SEJONG-stem and SEJONG-morph in Table 4. As before, many of them are indeed found in loanwords because of vowel insertion, while a few of them are not. The resulting seeds are much smaller 15

17 but contain less false positives than before: 59/642 = 9.20% and 58/323 = 17.96% when using stems and morphemes respectively vs. 489/2500 =19.56% when using eojeols. Scores of the seeded classifiers trained with smoothstep EM are listed in rows 10 and 11 in Table 5. Compared to the classifier trained on eojeols, precision improves by 1.55 and 2.14% points but recall plummets by and 23.81% points. The gain in precision is tiny compared to loss in recall. Perhaps one could prevent the loss in recall by adding more data. But the current results suggest that the proposed methods are good enough, if not better off, without morphological processing. Robustness to varying data: Experiments with different Korean data suggest that the proposed methods are effective in Korean in general rather than the particular data used above. A new classifier was built from scratch on KAIST using rule-based transcriptions and smoothstep EM and tested on NIKL-2. Its performance was compared with the unsupervised classifier trained on SEJONG and a new supervised classifier trained on subsets of NIKL-2. The foreign seed extraction method again correctly identifies the default vowel and its site of insertion. It picks 26 phoneme strings including the vowel as potential traces of insertion (see KAIST in Table 4). All but one of them have example loanwords in which they occur as traces of vowel insertion. The phoneme strings lead to a foreign seed consisting of 4,179 eojeols. The seed contains relatively more false positives (27.35%) than when using eojeols in SEJONG (19.56%). But scores of the SEJONG classifier and the resulting KAIST classifier tested on NIKL-2 are barely different (see rows 13 and 15 in Table 5). The SEJONG classifier is behind the supervised classifier by 5.31% point in precision and 11.20% in recall (see row 12 in Table 5 for scores of the supervised classifier). The difference is slightly larger than the difference observed with NIKL-1. This is most likely because SEJONG is more different from NIKL-2 than it is from NIKL-1. The perplexity of a character bigram model trained on SEJONG is higher on NIKL-2 (564.55) than on NIKL-1 (484.18). Adaptation: Unlike the supervised classifier, the training data and the test data for the unsupervised classifiers come from different sources. For example, one unsupervised classifier was trained on SEJONG and tested on NIKL-1, while the supervised classifier compared with it was both trained and tested on NIKL-1. So the comparison between the two was not entirely fair. Experiments show that a simple adaptation method such as linear interpolation can fix the problem. In sum, a baseline classifier is interpolated 16

18 with a new classifier that inherits parameters from the baseline classifier and iteratively trained on adaptation data. The classifiers are interpolated and make predictions according to the following equation: ĉ(w) = arg max(1 λ) P base (w, c) + λ P new (w, c) (8) c Here the baseline classifier is the classifier trained on words from an unlabeled corpus (e.g. SEJONG) and adaptation data is the portion of the labeled data (e.g. NIKL-1) used to train the comparable supervised classifier. Of course, the adaptation data does not include labels from the original data. The idea is not to provide feedback but to merely expose the classifier to the kinds of words it will be asked to classify later. In the experiments, the new classifier was trained on 90% of the adaptation data with smoothstep EM just like the baseline classifier. The interpolation weights were estimated using the remaining 10% with the classic EM algorithm. Applying the method to adapt the SEJONG and KAIST classifiers to the NIKL data significantly improves their performance. F-scores of the unsupervised classifiers after adaptation are behind the comparable supervised classifiers by no more than 2.5% points. See rows 3, 14, and 16 in Table 5 for scores after adaptation. <Insert Table 5 here> 5 Applicability to other languages: a pilot study in Japanese Ideally, the proposed approach should work with any language that does not allow consonant clusters and relies on vowel insertion to repair foreign clusters. In this section, I demonstrate its potential applicability with a pilot study in Japanese. In addition to not allowing consonant clusters, Japanese does not allow consonants in coda except the moraic nasal (e.g. [san]) and the first part of a geminate obstruent that straddles two syllables (e.g. [kip.pu]). The vowel inserted for repair is [u] usually (e.g. フランス [huransu] for France ), but [o] for coronal stops [t] and [d] (e.g. トレンド [torendo] for trend ). It is inserted between the consonants to repair onset clusters and after the consonants to repair coda clusters beginning with [n]. But for other 17

19 coda clusters, it is inserted after each consonant of the cluster (e.g. ヘルス [herusu] for health ). The patterns are similar to Korean, so the approach should work without much modification. The data for the experiment consists of 108,816 words for training and 148,128 words for testing. The training data came from the JEITA corpus (Hagiwara, 2013). It is not obvious to tell word boundaries and pronunciation in raw Japanese text. Words are not delimited by white space and sometimes spelled in kanji which are logographic rather than hiragana or katakana which are phonographic. Fortunately, the corpus comes with the words segmented and additionally spelled in katakana. It is those katakana spellings that constitute the training data. The test data came from JMDict (Breen, 2004), a lexicon annotated with various information including pronunciation transcribed in either hiragana or katakana and source language if a word is a loanword. Since loanwords in Japanese are spelled in katakana, I labeled words spelled without any katakana characters as native and words that had language source information and spelled only in katakana as foreign. This led to the test set of 130,237 native words and 17,891 foreign words. Some of the words in the training and test data were respelled to make the classification task non-trivial. First, all words in hiragana were respelled in katakana (e.g. それソレ ). Otherwise, one could simply label any word in hiragana as native and avoid false positives. Second, all instances of choonpu were replaced with proper vowel characters given the context (e.g. ハプン [haapuun] harpoon ハアプウン ). The character in katakana indicates long vowels, which in hiragana are indicated by adding an extra vowel character. Without the correction, one could simply label words with choonpu as foreign and identify a significant portion of loanwords. The n-gram models in the experiment were trained on katakana character bigrams. Phonetic transcriptions for foreign seed extraction were generated essentially by romanization. Katakana symbols were romanized following the Nihon-shiki system (e.g. シャツ syatu) and each letter was mapped to the corresponding phonetic symbol (e.g. syatu [sjatu]). All other aspects of the experiment were set up in the same way as the experiments in Korean. The results appear promising. The foreign seed extraction method identifies [u] as the default vowel and its site of insertion as between consonants in onset and after consonants in coda. It picks 14 phoneme strings including the vowel as potential traces of insertion (see JEITA in Table 4). Eight of them have example loanwords in which they occur as traces of vowel 18

20 insertion. The phoneme strings lead to a foreign seed consisting of 173 words that include 68 false positives (46.26%). It is encouraging that the method correctly identifies the default vowel insertion strategy. But the resulting foreign seed is quite small partly because the corpus is small to begin with and less accurate than the seeds in the Korean experiments. Classification scores are listed in rows in Table 5. Overall, the scores are lower than the scores achieved in Korean. Considering that the scores are lower even for the supervised classifier, it seems that character bi-grams are less effective in Japanese than Korean. As expected from the size of the foreign seed, recall of the unsupervised classifier is quite low. But after adaptation to the lexicon, recall improves significantly and F-score is not that far behind the supervised classifier. 6 Conclusion I proposed an unsupervised method for developing a classifier that identifies loanwords in Korean text. As shown in the experiments discussed above, the method can yield an effective classifier that can be made to perform at a level comparable to that of a supervised classifier. The method is cost-efficient as it does not require language resources other than a large monolingual corpus, a grapheme-to-phoneme converter, and perhaps a lexicon to supplement the corpus. The method is in principle applicable to a wide range of languages, i.e. those that rely on vowel insertion to repair illegal consonant clusters. Results from the pilot experiment in Japanese were encouraging. Future studies will further explore applicability of the method to other languages, especially under-resourced languages. References Baker, K. and Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 08), pages Bali, R.-M., Chong, C. C., and Pek, K. N. (2007). Identifying and classifying 19

21 unknown words in Malay texts. In Proceedings of the 7th International Symposium on Natural Language Processing, pages Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-tophoneme conversion. Speech Communication, 50(5): Breen, J. (2004). JMDict: a Japanese-multilingual dictionary. In Proceedings of the Workshop on Multilingual Linguistic Resources, pages Association for Computational Linguistics. Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In Kingston, J. and Beckman, M., editors, Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, pages Cambridge: Cambridge University Press. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages Goldberg, Y. and Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational Linguistics and Intelligent Text Processing, pages Springer Berlin - Heidelberg. Hagiwara, M. (2013). Chasen format). JEITA public morphologically tagged corpus (in Hall, N. (2011). Vowel epenthesis. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages Malden, MA & Oxford: Wiley-Blackwell. Haspelmath, M. and Tadmor, U. (2009). Loanwords in the World s Languages: A Comparative Handbook. Walter de Gruyter. Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information retrieval. Information Processing and Management, 35: Kang, Y. (2011). Loanword phonology. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages Malden, MA & Oxford: Wiley-Blackwell. Khaltar, B.-O. and Fujii, A. (2009). A lemmatization method for Mongo- 20

22 lian and its application to indexing for information retrieval. Information Processing & Management, 45(4): Knight, K. and Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4): Korea Advanced Institute of Science and Technology (1997). Automatically analyzed large scale KAIST corpus [Data file]. Ladefoged, P. (2001). A Course in Phonetics. Orlando: Harcourt Brace, 4 edition. Maddieson, I. (2013). Syllable structure. In Dryer, M. S. and Haspelmath, M., editors, The World Atlas of Language Structures Online, Leipzig. Max Planck Institute for Evolutionary Anthropology. Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language (2011). The 21st century Sejong project [Data file]. NIKL (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. NIKL (2000b). pyojuneo geomtoyong jaryo. Resource document. NIKL (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. NIKL (2000d). yongeon hwalyongpyo. Resource document. NIKL (2008). Survey of the state of loanword usage. [Data file]. NIKL (2013). oeraeeo pyogi yongrye jaryo romaja inmyeonggwa jimyeong. Resource document. Nwesri, A. F. A. (2008). Effective Retrieval Techniques for Arabic text. PhD thesis, RMIT University, Melbourne, Australia. Oh, J.-H. and Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the International Conference on Computer Processing of Oriental Languages, pages Ravi, S. and Knight, K. (2009). Learning phoneme mappings for transliteration without parallel data. In Proceedings of Human Language Technolo- 21

23 gies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages Selkirk, E. (1984). On the major class features and syllable theory. In Aronoff, M. and Oerhle, R. T., editors, Language Sound Structure: Studies in Phonology Presented to Morris Halle by His Teachers and Students, pages Cambridge, MA: MIT Press. Sohn, H.-M. (1999). The Korean Language. Cambridge: Cambridge University Press. Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7): Witten, I. H. and Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4): Yoon, K. and Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Language, 20(4): Yoon, S.-Y., Kim, K.-Y., and Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages

24 Appendix A. Rewrite rules for grapheme-tophoneme conversion The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ [hankɨl]. Letter Phoneme(s) Letter Phoneme(s) Letter Phoneme(s) ᄀ k ᄁ k* ᄂ n ᄃ t ᄄ t* ᄅ (onset) ɾ ᄅ (coda) l ᄆ m ᄇ p ᄈ p* ᄉ s ᄊ s* ᄋ (onset) Null ᄋ (coda) ŋ ᄌ t ʃ ᄍ t ʃ* ᄎ t ʃʰ ᄏ kʰ ᄐ tʰ ᄑ pʰ ᄒ h ㅏ a ㅑ j a ㅐ æ ᅤ j æ ᅥ ʌ ᅧ j ʌ ᅦ e ᅨ j e ᅩ o ᅭ j o ᅪ w a ᅫ w æ ᅬ ø ᅮ u ᅲ j u ᅯ w ʌ ᅰ w e ᅱ w i ᅳ ɨ ᅵ i ᅴ ɨ i 23

25 Table 1: Korean phonemes and their place in the proposed sonority hierarchy. Class Obstruents Sonorants Vowels Phonemes p p* pʰ t t* tʰ k k* kʰ s s* h t ʃ t ʃ* t ʃʰ m n ŋ ɾ l w j a e i o u æ ʌ ø ɨ ɨ i 24

26 Table 2: Data sizes in number of unique words or eojeols. Class SEJONG KAIST NIKL-1 NIKL-2 JEITA JMDict Native unknown unknown 49,962 44,214 unknown 130,237 Foreign unknown unknown 21,176 18,943 unknown 17,891 Total 1,019,863 2,409,309 71,138 63, , ,128 25