An Unsupervised Method for Identifying Loanwords in Korean

Size: px
Start display at page:

Download "An Unsupervised Method for Identifying Loanwords in Korean"

Transcription

1 An Unsupervised Method for Identifying Loanwords in Korean Hahn Koo San Jose State University Manuscript to appear in Language Resources and Evaluation The final publication is available at Springer via

2 Abstract This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese. Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm; Korean 1

3 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from words in a foreign language. Their forms, both pronunciation and spelling, are often nativized. Their pronunciations adapt to conform to native sound patterns. Their spellings are transliterated using the native script and reflect the adapted pronunciations. For example, flask [flæsk] in English becomes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concerned with building a system that scans Korean text and identifies loanwords 1 spelled in Hangul, the Korean alphabet. Such a system can be useful in many ways. First, one can use the system to collect data to study various aspects of loanwords (e.g. Haspelmath and Tadmor, 2009) or develop machine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight, 2009). Loanwords or transliterations (e.g. 플라스크 ) can be extracted from monolingual corpora by running the system alone. Transliteration pairs (e.g. <flask, 플라스크 >) can be extracted from parallel corpora by first identifying the output with the system and then matching input forms based on scoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second, the system allows one to use etymological origins of words as a feature and be more discrete in text processing. For example, grapheme-to-phoneme conversion in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri, 2008) can be improved by keeping separate rules for native words and loanwords. The system can be used to classify a given word into either category and apply the proper set of rules. The loanword identification system envisioned here is a binary, characterbased n-gram classifier. Given a word (w) spelled in Hangul, the classifier decides whether the word is of native (N) or foreign (F ) origin by Bayesian classification, i.e. solving the following equation: ĉ(w) = arg max P (w c) P (c) (1) c {N,F } The likelihood P (w c) is calculated using a character n-gram model specific 1 In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn, 1999). 2

4 to that class. The classifier is trained on a corpus in an unsupervised manner, building on seed words extracted from the corpus. The native seed consists of words with high token frequency in the corpus. The idea is that frequent words are more likely to be native words than foreign words. The foreign seed consists of words that contain what appear to be traces of vowel insertion. Korean does not have words that begin or end with consonant clusters. Like many other languages with similar phonotactics (e.g. Japanese), foreign words with consonant clusters are transliterated with vowels inserted to break the clusters. So presence of substrings that resemble traces of insertion suggests that a word may be of foreign origin. An obvious problem is deciding what those traces look like a priori. Here the problem is resolved by a heuristic based on phoneme co-occurrence statistics and rudimentary ideas and findings in phonology. The rest of the paper is organized as follows. In Section 2, I discuss previous studies in foreign word identification as well as ideas and findings in phonology that the present study builds on. I describe the proposed method for developing the unsupervised classifier in detail in Section 3. I discuss experiments that evaluate the effectiveness of the method in Korean in Section 4 and pilot experiments in Japanese that explore its applicability to other languages in Section 5. I conclude the paper in Section 6. 2 Background This work is motivated by previous studies on identifying loanwords or foreign words in monolingual data. Many of them rely on the assumption that distribution of strings of sublexical units such as phonemes, letters, and syllables differs between words of different origins. Some write explicit and categorical rules stating which substrings are characteristic of foreign words (e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllable n-gram models separately for native words and foreign words and compare the two. It has been shown that the n-gram approach can be very effective in Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001). Training the n-gram models is straightforward with labeled data in which words are tagged either native or foreign. But creating labeled data can be expensive and tedious. In response, some have proposed methods for generat- 3

5 ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldberg and Elhadad (2008) for Hebrew. In both studies, the authors suggest generating pseudo-loanwords by applying transliteration rules to a foreign lexicon such as the CMU Pronouncing Dictionary. They suggest different methods for generating pseudo-native words. Baker and Brew extract words with high token frequencies in a Korean newswire corpus assuming that frequent words are more likely to be native than foreign. Goldberg and Elhadad extracted words from a collection of old Hebrew texts assuming that old texts are much less likely to contain foreign words than recent texts. The approach is effective and a classifier trained on the pseudo-labeled data can perform comparably to a classifier trained on manually labeled data. Baker and Brew trained a logistic regression classifier using letter trigrams on about 180,000 pseudo-words, half pseudo-korean and half pseudo-english. Tested on a labeled set of 10,000 native Korean words and 10,000 English loanwords, the classifier showed 92.4% classification accuracy. In comparison, the corresponding classifier trained on manually labeled data showed 96.2% accuracy in a 10-fold cross-validation experiment. The pseudo-annotation approach obviates the need to manually label data. But one has to write a separate set of transliteration rules for every pair of languages. In addition, the transliteration rules may not be available to begin with, if the very purpose of identifying loanwords is to collect training data for machine transliteration. The foreign seed extraction method proposed in the present study is an attempt to reduce the level of language-specificity and demand for additional natural language processing capabilities. The method essentially equips one with a subset of transliteration rules by presupposing a generic pattern in pronunciation change, i.e. vowel insertion. The method should be applicable to many language pairs. The need to repair consonant clusters arises for many language pairs and vowel insertion is a repair strategy adopted in many languages. Foreign sound sequences that are phonotactically illegal in the native language are usually repaired rather than overlooked. A common source of phonotactic discrepancy involves consonant clusters: different languages allow consonant clusters of different complexity. Maddieson (2013) identifies 151 languages that allow a wide variety of consonant clusters, 274 languages that allow only a highly restricted set of clusters, and 61 languages that do not allow clusters at all. Illegal clusters are repaired by vowel insertion or consonant deletion, but vowel insertion appears to be cross-linguistically more common (Kang, 2011). 4

6 The vowel insertion pattern is initially characterized only generically as insert vowel X in position Y to repair consonant cluster Z. The generic nature of the characterization ensures language-neutrality. But in order for the pattern to be of any use, one must eventually flesh out the details and provide instances of the pattern equivalent to specific transliteration rules: insert [u] between the consonants to repair [sm] or [sm] [sum], for example. Here the language-specific details of vowel insertion are discovered from a corpus in a data-driven manner but the search process is guided by findings and ideas in phonology. As will be described in detail below, possible values of which vowel is inserted where are constrained based on typological studies of loanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g. Hall, 2011). Possible consonant sequences originating from a cluster are delimited by the idea of sonority sequencing principle (e.g. Clements, 1990). 3 Proposal The goal is to build a Bayesian classifier made of two character n-gram models: one for native words (N) and the other for foreign words (F ). That is, ĉ(w) = arg max P (c) P (w c) arg max c {N,F } c {N,F } P (c) i P (g i g i 1 i n+1, c) (2) where g i is the i th character of w and g i 1 i n+1 is the string of n 1 characters preceding it. In this study, the n-gram models use Witten-Bell smoothing (Witten and Bell, 1991) for its ease of implementation. That is, P (g i g i 1 i n+1, c) = (1 λ c(g i 1 i n+1 )) P mle(g i g i 1 i n+1, c)+λ c(g i 1 i n+1 ) P (g i g i 1 i n+2, c) (3) So the parameters of the classifier consist of P (c), P mle (g i g i 1 i n+1, c), and λ c (g i 1 i n+1 ). They can be estimated from data as follows: P (c) = w c 5 z(w, c) w z(w, c ) (4)

7 P mle (g i g i 1 i n+1, c) = w freq w(gi n+1) i z(w, c) w freq w(g i 1 i n+1 ) z(w, c) (5) λ c (g i 1 i n+1 ) = w freq w(g i 1 i n+1 ) z(w, c) N 1+ (g i 1 i n+1 ) + w freq w(g i 1 i n+1 ) z(w, c) (6) z(w, c) indicates whether w is classified c: z(w, c) = 1 if it is and z(w, c) = 0 otherwise. freq w (x) is the number of times x occurs in w. N 1+ (g i 1 i n+1 ) is the number of different n-grams prefixed g i 1 i n+1 that occur at least once. The challenge here is that the training corpus is unlabeled, i.e. z(w, c) is hidden. I use variants of the EM algorithm to iteratively guess z(w, c) and update the parameters. The n-gram models are initialized with seed words extracted from the corpus. For the native class, I use high frequency words in the corpus as seed words: for example, all words whose token frequency is in the 95th percentile. For the foreign class, I first use sublexical statistics to list phoneme strings that would result from vowel insertion and then use words that contain the phoneme strings as seed words. Below I describe in detail how foreign seed words are extracted and how the seeded classifier is iteratively trained. 3.1 Foreign seed extraction The method aims to identify loanwords whose original forms contain consonant clusters and use them as foreign seed words. This is done by string/pattern matching, where the pattern consists of phoneme strings that can result from vowel insertion. Consonant clusters do not begin or end syllables in Korean. When foreign words are borrowed, consonant clusters are repaired by inserting a vowel somewhere next to the consonants to break the cluster into separate syllables. Speakers usually insert the same vowel in the same position to repair a given consonant cluster. As a result, transliterations of different words with the same consonant cluster all share a common substring showing trace of insertion. For example, 트라이 (try), 트레인 (train), 트리 (tree), 트롤 (troll), and 트루 (true) all have 트ㄹ which is pronounced [tʰɨɾ]. The idea is to figure out what those signature substrings are in advance and look for words that have them. There is a risk of false positives since 6

8 such substrings may exist for reasons other than vowel insertion. But the hope is that the seeded classifier will gradually learn to be discrete and use other substrings in words for further disambiguation. The phoneme strings defining the pattern are specified below as tuples of the form < C 1 C 2, V id, V loc > for ease of description. Each tuple characterizes a phoneme string made of two consonants and a vowel. C 1 and C 2 are the two consonants. V id is the identity of the vowel. V loc is the location of the vowel relative to the consonants, i.e. between, before, or after the consonants. For example, <s, n, ɨ, between> means [sɨn] as in [sɨnou] for 스노우 (snow) and <n, tʰ, ɨ, after> means [ntʰɨ] as in [hintʰɨ] for 힌트 (hint). The idea is to use C 1 C 2 to specify consonants from a foreign cluster and V id and V loc to specify which vowel is inserted where to repair the cluster. Rather than manually listed using language expertise, the tuples are discovered from a corpus using the following heuristic: 1. List words that appear atypical compared with the native seed words. 2. Extract < C 1 C 2, V id, V loc > tuples from the atypical words where (a) C 1 C 2 respects the sonority sequencing principle. (b) V id and V loc most strongly cooccur with C 1 C 2 among all vowels. 3. Identify the most common V id as the default vowel used for insertion. Keep tuples whose V id matches the default vowel and throw away the rest. 4. Identify the most common V loc of the default vowel as its site of insertion for clusters in each syllable position (onset or coda). Keep tuples whose V loc matches the identified site of insertion and throw away the rest. The basic idea is to find recurring combinations of two consonants that potentially came from a foreign cluster and a vowel. Step 1 defines the search space. It should be easier to see the target pattern if we zeroed in on loanwords. Native words have various morphological patterns that can obscure the target pattern. Of course, it is not yet known which words are loanwords. So instead the method avoids words similar to what are currently believed to be native words, i.e. the native seed words. Put differently, words dissimilar to the native seed words are tentatively loanwords. Here the similarity is measured by a word s length-normalized probability according to a character 7

9 n-gram model trained on the native seed words: 1/ w log P (w) for word w of length w. A word is atypical if its probability ranks below a threshold percentile (e.g. 5%). Step 2 generates a first-pass list. Condition 2a delimits possible consonant sequences from a foreign cluster. According to the sonority sequencing principle, consonants in a syllable are ordered so that consonants of higher sonority appear closer to the vowel of the syllable. There are different proposals on what sonority is and how different classes of consonants rank on the sonority scale (e.g. Clements, 1990; Selkirk, 1984; Ladefoged, 2001). Here I simply classify consonants as either obstruents or sonorants (see Table 1) and stipulate that sonorants have higher sonority than obstruents. I also assume that the sonority of consonants do not change during transliteration although their identities may change. For example, free changes from [fɹi] to [pʰɨɾi], but [pʰ] remains an obstruent and [ɾ] remains a sonorant. Accordingly, C 1 C 2 must be obstruent-sonorant if it is from an onset cluster and sonorant-obstruent if it is from a coda cluster. To determine with certainty whether the consonants originally occupied onset or coda, I focus on phoneme strings found only at word boundaries. If C 1 C 2 are the first two consonants of a word, they are from onset. If they are the last two consonants of a word, they are from coda. <Insert Table 1 here> Condition 2b is used to guess the vowel inserted to repair each cluster. Only one vowel is repeatedly used so its co-occurrence with the consonants should not only be noticeable but most noticeable among all vowels. Here the co-occurrence tendency is measured using point-wise mutual information: P MI(C 1 C 2, V ) = log P (C 1 C 2, V ) log P (C 1 C 2 ) P (V ) where V =< V id, V loc >. The list is truncated to avoid false positives in steps 3 and 4. This is done by identifying the default vowel insertion strategy and keeping only the tuples consistent with it. Exactly which vowel is inserted where to repair a consonant cluster is context-specific. But a language that relies on vowel insertion for repair usually has a default vowel inserted in typical locations (cf. Uffmann, 2006). Here it is assumed that the default vowel is the one used to repair most diverse consonant clusters. So it is the most frequent vowel in the list. Similarly, its default site of insertion is in principle its most frequent location in the list. But possible sites of insertion differ for onset 8

10 clusters and coda clusters: before or between the consonants in onset, but after or between the consonants in coda (Hall, 2011). So the default site of insertion is identified separately for onset and coda. 3.2 Bootstrapping with EM The parameters (θ) to estimate are P (c), P mle (g i g i 1 i n+1, c), and λ c(g i 1 i n+1 ). The first parameter P (c) is initialized according to some assumption about what proportion of words in the given corpus are loanwords. For example, if one assumes that 5% are loanwords, P (N) = 0.95 and P (F ) = The latter two parameters, which define the n-gram models, are initialized using the seed words as if they were labeled data: z(w, N) = 1 and z(w, F ) = 0 for native seed words and z(w, N) = 0 and z(w, F ) = 1 for foreign seed words. Note that other words in the corpus are not used to initialize the n-gram models. The initial parameters are then updated on the whole corpus by iterating the following two steps until some stopping criterion is met. E-step: Calculate the expected value of z(w, c) using current parameters. E[z(w, c)] = P (c w; θ (t) ) = P (w c; θ(t) ) P (c; θ (t) ) c P (w c ; θ (t) ) P (c ; θ (t) ) (7) M-step: Transform the expected value to ẑ(w, c), i.e. some estimate of z(w, c), and plug it into equations (4-6) to update the parameters. I experiment with three versions of the algorithm in the present study: soft EM, hard EM, and smoothstep EM. The three differ with respect to how E[z(w, c)] is transformed to ẑ(w, c). In soft EM, which is the same as the classic EM algorithm (Dempster et al., 1977), there is no transformation, i.e. ẑ(w, c) = E[z(w, c)]. In hard EM, ẑ(w, c) = 1 if c = arg max c E[z(w, c )] and ẑ(w, c) = 0 otherwise. Since there are only two classes here, this is equivalent to applying a threshold function at 0.5 to E[z(w, c)]. In smoothstep EM, a smooth step function is applied instead of the threshold function: ẑ(w, c) = f 3 (E[z(w, c)]) where f(x) = 2x 3 + 3x 2. Figure 1 illustrates how E[z(w, c)] is transformed to ẑ(w, c) by the three variants of the EM algorithm. <Insert Figure 1 here> 9

11 As will be shown in the experiments below, soft EM is aggressive while hard EM is conservative in recruiting words to the foreign class. Soft EM gives partial credit even to words that are very unlikely to be foreign according to the current model. Over time, such words may manage to gain enough confidence and be considered foreign. Some of them may turn out to be false positives. On the other hand, hard EM does not give any credit even to words that are just barely below the threshold to be considered foreign. Some of them may turn out to be false negatives. Smoothstep EM is a compromise between the two extremes. It virtually ignores words that do not stand a chance but gives due credit to words that barely missed. 4 Experiments Experiments show that the proposed approach can be effective in Korean despite its unsupervised nature. Classifiers built on a raw corpus with minor preprocessing (e.g. removing tokens with non-hangul characters) identify loanwords in test lexicons well. The foreign seed extraction method correctly identifies the default vowel insertion strategy in Korean loanword phonology. The resulting classifier performs better when initialized with the proposed seeding method than random seeding. Its performance is not that far behind the corresponding supervised classifier either. Moreover, after exposure to the words (but not their labels) used to train the supervised classifier, the unsupervised classifier performs at a level comparable to the supervised classifier. I discuss the details of the experiments below. 4.1 Methods I use four datasets called SEJONG, KAIST, NIKL-1, and NIKL-2 below. SEJONG and KAIST are unlabeled data used to initialize and train the unsupervised classifier. SEJONG consists of 1,019,853 types and 9,206,430 tokens of eojeols, which are character strings delimited by white space equivalent to words or phrases. The eojeols are from a morphologically annotated corpus developed in the 21st Century Sejong Project under the auspices of the Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language (2011). They were selected by 10

12 extracting Hangul character strings delimited by white-space after removing punctuation marks. Strings that contained non-hangul characters (e.g. 12 월의, Farrington 으로부터 ) were excluded in the process. KAIST consists of 2,409,309 types and 31,642,833 tokens of eojeols from the KAIST corpus (Korea Advanced Institute of Science and Technology, 1997) extracted in the same way as SEJONG. NIKL-1 and NIKL-2 are labeled data used to test the classifier. They are made of words from various language resources released by the National Institute of the Korean Language (NIKL). NIKL-1 consists of 49,962 native words and 21,176 foreign words selected from two lexicons NIKL (2008, 2013). NIKL-2 consists of 44,214 native words and 18,943 foreign names selected from four reports released by NIKL (2000a,b,c,d) and a list of transliterated names of people and places originally spelled in Latin alphabets (NIKL, 2013). I examined the words manually and labeled them either native or foreign. Words of unknown or ambiguous etymological origin were excluded in the process. SEJONG and NIKL-1 are mainly used to examine effectiveness of the proposed methods. KAIST and NIKL-2 are used to examine whether the methods are robust to varying data. See Table 2 for a summary of data sizes. <Insert Table 2 here> The proposed methods are implemented as follows. All n-gram models are trained on character bigrams, where each Hangul character represents a syllable. The high frequency words defining the native seed are eojeols whose token frequency is above the 95th percentile in a given corpus. When extracting the foreign seed, the so-called atypical words are eojeols whose lengthnormalized n-gram probabilities lie in the bottom 5% according to the model trained on the native seed. Their phonetic transcriptions are generated by applying simple rewrite rules in Appendix A. For bootstrapping, the prior probabilities are initialized to P (c = N) = 0.95 and P (c = F ) = The parameters of the classifier are iteratively updated until the average likelihood of the data improves by no more than 0.01% or the number of iterations reaches 100. Classification performance is measured in terms of precision, recall, and F- score. Here, precision (p) is the percentage of words correctly classified as foreign out of all words classified as foreign. Recall (r) is the percentage of words correctly classified as foreign out of all words that should have been classified as foreign. F-score is a harmonic mean of the two with equal empha- 11

13 sis on both, i.e. F = 2 p r/(p+r). To put the numbers in perspective, scores of classifiers built using the proposed methods are compared with those of supervised classifiers and randomly seeded classifiers. Supervised classifiers are trained and tested on the labeled data (NIKL-1 or NIKL-2) using five-fold cross-validation. The labeled data is partitioned into five equal-sized subsets. The supervised classifier is trained on four subsets and tested on the remaining subset. This is repeated five times for the five different combinations of subsets. Randomly seeded classifiers are unsupervised classifiers with just a different seeding strategy: 5% of words in the corpus are randomly chosen as foreign seed words and the rest are native seed words. For fair comparison, the unsupervised classifiers are also tested five separate times on the five subsets of labeled data that the supervised classifier is tested on. Accordingly, classification scores reported below are the arithmetic means of scores on the five subsets. 4.2 Results and discussion The foreign seed extraction method correctly identifies the default vowel insertion strategy. Table 3 lists the number of different consonant clusters for which each vowel in Korean is selected as the top candidate. [ɨ] is predicted to be the default vowel as it is chosen most often overall. Its predicted site of insertion for onset clusters is between the consonants of each cluster as it is chosen more often there than before the consonants. Similarly, its predicted site of insertion for coda clusters is after the consonants of each cluster rather than between the consonants. <Insert Table 3 here> The 28 phoneme strings made of the default vowel and the consonant pairs it allegedly separates are listed in the row labeled SEJONG in Table 4. They specify what traces of vowel insertion would look like and define the pattern matched against the atypical words to extract the foreign seed. All but three of them indeed occur as traces of vowel insertion in one or more loanwords in the entire data used for the present study. The foreign seed consists of 2,500 eojeols (out of 50,992 atypical ones) that contain one or more of the phoneme strings. The foreign seed does contain false positives, but their proportion is not that big: 489/2500 (=19.56%). Since SEJONG is unlabeled and too large, it is hard to tell what percentage of loanwords the foreign seed 12

14 represents. But if one extracted all atypical words in NIKL-1 that contained the phoneme strings, it would return a foreign seed containing 458/21,176 = 2.16% of all the loanwords in the dataset. So the foreign seed is small in size and represents a tiny fraction of loanwords. <Insert Table 4 here> The seeded classifier can be trained effectively with smoothstep EM (see row 2 in Table 5 for scores). Despite the small seed, recall is high (85.51%) without compromise in precision (94.21%). The scores are, of course, lower than those of the supervised classifier (see row 1 in Table 5). Precision is lower by 2.67% points and recall is lower by 10.95% points. But considering the unsupervised nature of the approach, the scores are encouraging. The classifier performs better when trained with smoothstep EM than the other two variants of EM (see rows 4 and 5 in Table 5). Precision is just as high but recall is a bit lower (80.16%) when trained with hard EM. On the other hand, precision is miserable (47.81%) although recall is higher (91.46%) when trained with soft EM. Figure 2 illustrates how well the classifier performs on NIKL-1 over time as it is iteratively trained on SEJONG with the three variants of EM. Right after initialization, scores of the classifier are precision = 93.82% and recall = 52.07%. All three variants boost recall significantly within the first several iterations. Soft EM is the most successful, followed by smoothstep EM, and then hard EM. But while the other two not only maintain but also marginally improve precision, soft EM steadily loses precision throughout the whole training session. <Insert Figure 2 here> Bootstrapping is more effective with the proposed seeding method than random seeding. Scores of three different randomly seeded classifiers trained with smoothstep EM are listed in rows 6-8 in Table 5. Compared to the proposed classifier, although their precision is higher by around 1% point, their recall is lower by around 14% points. But their performance is rather consistent as well as strong and deserves a closer look. The three randomlyseeded classifiers all followed a similar trajectory as they evolved. To briefly describe the process using a clustering analogy, the foreign cluster, which started out as a small random set of 50,992 eojeols, immediately shrank to a much smaller set including those with hapax character bigrams whose type frequency is one. For one of the three classifiers, the foreign cluster shrank to a set of 5,421 eojeols as soon as training began and 2,061 of them 13

15 contained hapax bigrams. It is likely that many words containing hapax bigrams were loanwords and the foreign cluster eventually grew around them. In fact, among 4,378 words in NIKL-1 containing character bigrams that appear only once in SEJONG, 1,601 are native words and 2,777 are loanwords. The process makes intuitive sense. At the beginning, the foreign cluster is overwhelmed in size by the native cluster and unlikely to have homogeneous subclusters due to random initialization. Eojeols in the foreign cluster will be absorbed by the native cluster unless they have bigrams that seem alien to the native cluster. Hapax bigrams would be a prime example of such bigrams and as a result they figure more prominently in the foreign cluster. Loanwords are alien to begin with, so it makes sense that they are more likely to have hapax bigrams than native words. The dynamics involving data size, randomness, hapax bigrams, and loanwords are indeed interesting and did lead to good classifiers. But at the moment, it is not clear if they are reliable and predictable. More importantly, the proposed seeding method led to significantly better classifiers. Robustness to noise: The proposed methods are effective despite some noise in training data. There are two sources of noise in SEJONG: crude grapheme-to-phoneme conversion (G2P) and lack of morphological processing. G2P generates phonetic transcriptions required for foreign seed extraction. In the experiments above, the transcriptions were generated by applying a rather simple set of rules. Grapheme-phoneme correspondence in Hangul is quite regular, but there are phonological patterns such as coda neutralization and tensification (Sohn, 1999) that the rules do not capture. Accordingly, the resulting transcriptions would be decent approximations but occasionally incorrect. In fact, when the rules are tested on 14,007 words randomly chosen from the Standard Korean Dictionary, word accuracy and phoneme accuracy are 67.92% and 94.67%. One could ask if the proposed methods would perform better with more accurate transcriptions. An experiment with a better G2P suggests that the approximate transcriptions are good enough. A joint 5-gram model (Bisani and Ney, 2008) was trained on 126,068 words from the Standard Korean Dictionary. The model transcribes words in SEJONG differently from the rules: by 36.62% in terms of words and 5.53% in terms of phonemes. The model s transcriptions are expected to be more accurate. Its word accuracy and phoneme accuracy on the 14,007 words mentioned above are 95.30% and 99.35%. Building the classifier from scratch using the new transcriptions barely changes 14

16 results. The foreign seed extraction method again correctly identifies the default vowel insertion strategy. It identifies [ɨ] as the default vowel, inserted between the consonants in onset and after the consonants in coda. It picks 31 phoneme strings including the vowel as potential traces of insertion (see SEJONG-g2p in Table 4). All but four of them have example loanwords in which they occur as traces of vowel insertion. The set of phoneme strings is similar to the one identified before, with a 73.53% overlap between the two. The resulting foreign seed is even more similar to the previous seed, with a 84.35% overlap between the two. The new seed is slightly larger than the previous seed (2,527 vs. 2,500 words) but has a higher proportion of false positives (20.66% vs %). The two seeds lead to very similar classifiers trained with smoothstep EM. The two trained classifiers tag 99.39% of words in NIKL-1 in the same way and their scores differ by only 0.24% 0.48% points (see row 9 in Table 5 for the new classification scores). The training data in the experiments above include eojeols containing both native and foreign morphemes. Loanwords can be suffixed with native morphemes, combine with native words to form compounds, or both. A good example is 투자펀드를 (investment-fund-acc) where 투자 and 를 are native and 펀드 is foreign. Such items may mislead the classifier to recruit false positives during training. One could ask if performance of the proposed methods can be improved by stemming or further morpheme segmentation. Experiments suggest that they improve precision but at the sacrifice of recall. Data for the experiments consists of a set of 250,844 stems and a set of 132,430 non-suffix morphemes in SEJONG. Eojeols in SEJONG are morphologically annotated in the original corpus. For example, 투자펀드를 is annotated 투자 /NNG + 펀드 /NNG + 를 /JKO. Stems were extracted by removing substrings tagged as suffixes and particles (e.g. 투자펀드를 투자펀드 ). Non-suffix morphemes were extracted by splitting the derived stems at specified morpheme boundaries (e.g. 투자펀드 투자 and 펀드 ). Two classifiers were built from scratch with rule-based transcriptions: one using the stems and the other using the morphemes. The foreign seed extraction method is effective as when it was applied to eojeols. It correctly identifies the default vowel and its site of insertion in both data sets. The phoneme strings identified as potential traces of insertion are listed in rows labeled SEJONG-stem and SEJONG-morph in Table 4. As before, many of them are indeed found in loanwords because of vowel insertion, while a few of them are not. The resulting seeds are much smaller 15

17 but contain less false positives than before: 59/642 = 9.20% and 58/323 = 17.96% when using stems and morphemes respectively vs. 489/2500 =19.56% when using eojeols. Scores of the seeded classifiers trained with smoothstep EM are listed in rows 10 and 11 in Table 5. Compared to the classifier trained on eojeols, precision improves by 1.55 and 2.14% points but recall plummets by and 23.81% points. The gain in precision is tiny compared to loss in recall. Perhaps one could prevent the loss in recall by adding more data. But the current results suggest that the proposed methods are good enough, if not better off, without morphological processing. Robustness to varying data: Experiments with different Korean data suggest that the proposed methods are effective in Korean in general rather than the particular data used above. A new classifier was built from scratch on KAIST using rule-based transcriptions and smoothstep EM and tested on NIKL-2. Its performance was compared with the unsupervised classifier trained on SEJONG and a new supervised classifier trained on subsets of NIKL-2. The foreign seed extraction method again correctly identifies the default vowel and its site of insertion. It picks 26 phoneme strings including the vowel as potential traces of insertion (see KAIST in Table 4). All but one of them have example loanwords in which they occur as traces of vowel insertion. The phoneme strings lead to a foreign seed consisting of 4,179 eojeols. The seed contains relatively more false positives (27.35%) than when using eojeols in SEJONG (19.56%). But scores of the SEJONG classifier and the resulting KAIST classifier tested on NIKL-2 are barely different (see rows 13 and 15 in Table 5). The SEJONG classifier is behind the supervised classifier by 5.31% point in precision and 11.20% in recall (see row 12 in Table 5 for scores of the supervised classifier). The difference is slightly larger than the difference observed with NIKL-1. This is most likely because SEJONG is more different from NIKL-2 than it is from NIKL-1. The perplexity of a character bigram model trained on SEJONG is higher on NIKL-2 (564.55) than on NIKL-1 (484.18). Adaptation: Unlike the supervised classifier, the training data and the test data for the unsupervised classifiers come from different sources. For example, one unsupervised classifier was trained on SEJONG and tested on NIKL-1, while the supervised classifier compared with it was both trained and tested on NIKL-1. So the comparison between the two was not entirely fair. Experiments show that a simple adaptation method such as linear interpolation can fix the problem. In sum, a baseline classifier is interpolated 16

18 with a new classifier that inherits parameters from the baseline classifier and iteratively trained on adaptation data. The classifiers are interpolated and make predictions according to the following equation: ĉ(w) = arg max(1 λ) P base (w, c) + λ P new (w, c) (8) c Here the baseline classifier is the classifier trained on words from an unlabeled corpus (e.g. SEJONG) and adaptation data is the portion of the labeled data (e.g. NIKL-1) used to train the comparable supervised classifier. Of course, the adaptation data does not include labels from the original data. The idea is not to provide feedback but to merely expose the classifier to the kinds of words it will be asked to classify later. In the experiments, the new classifier was trained on 90% of the adaptation data with smoothstep EM just like the baseline classifier. The interpolation weights were estimated using the remaining 10% with the classic EM algorithm. Applying the method to adapt the SEJONG and KAIST classifiers to the NIKL data significantly improves their performance. F-scores of the unsupervised classifiers after adaptation are behind the comparable supervised classifiers by no more than 2.5% points. See rows 3, 14, and 16 in Table 5 for scores after adaptation. <Insert Table 5 here> 5 Applicability to other languages: a pilot study in Japanese Ideally, the proposed approach should work with any language that does not allow consonant clusters and relies on vowel insertion to repair foreign clusters. In this section, I demonstrate its potential applicability with a pilot study in Japanese. In addition to not allowing consonant clusters, Japanese does not allow consonants in coda except the moraic nasal (e.g. [san]) and the first part of a geminate obstruent that straddles two syllables (e.g. [kip.pu]). The vowel inserted for repair is [u] usually (e.g. フランス [huransu] for France ), but [o] for coronal stops [t] and [d] (e.g. トレンド [torendo] for trend ). It is inserted between the consonants to repair onset clusters and after the consonants to repair coda clusters beginning with [n]. But for other 17

19 coda clusters, it is inserted after each consonant of the cluster (e.g. ヘルス [herusu] for health ). The patterns are similar to Korean, so the approach should work without much modification. The data for the experiment consists of 108,816 words for training and 148,128 words for testing. The training data came from the JEITA corpus (Hagiwara, 2013). It is not obvious to tell word boundaries and pronunciation in raw Japanese text. Words are not delimited by white space and sometimes spelled in kanji which are logographic rather than hiragana or katakana which are phonographic. Fortunately, the corpus comes with the words segmented and additionally spelled in katakana. It is those katakana spellings that constitute the training data. The test data came from JMDict (Breen, 2004), a lexicon annotated with various information including pronunciation transcribed in either hiragana or katakana and source language if a word is a loanword. Since loanwords in Japanese are spelled in katakana, I labeled words spelled without any katakana characters as native and words that had language source information and spelled only in katakana as foreign. This led to the test set of 130,237 native words and 17,891 foreign words. Some of the words in the training and test data were respelled to make the classification task non-trivial. First, all words in hiragana were respelled in katakana (e.g. それ ソレ ). Otherwise, one could simply label any word in hiragana as native and avoid false positives. Second, all instances of choonpu were replaced with proper vowel characters given the context (e.g. ハ プ ン [haapuun] harpoon ハアプウン ). The character in katakana indicates long vowels, which in hiragana are indicated by adding an extra vowel character. Without the correction, one could simply label words with choonpu as foreign and identify a significant portion of loanwords. The n-gram models in the experiment were trained on katakana character bigrams. Phonetic transcriptions for foreign seed extraction were generated essentially by romanization. Katakana symbols were romanized following the Nihon-shiki system (e.g. シャツ syatu) and each letter was mapped to the corresponding phonetic symbol (e.g. syatu [sjatu]). All other aspects of the experiment were set up in the same way as the experiments in Korean. The results appear promising. The foreign seed extraction method identifies [u] as the default vowel and its site of insertion as between consonants in onset and after consonants in coda. It picks 14 phoneme strings including the vowel as potential traces of insertion (see JEITA in Table 4). Eight of them have example loanwords in which they occur as traces of vowel 18

20 insertion. The phoneme strings lead to a foreign seed consisting of 173 words that include 68 false positives (46.26%). It is encouraging that the method correctly identifies the default vowel insertion strategy. But the resulting foreign seed is quite small partly because the corpus is small to begin with and less accurate than the seeds in the Korean experiments. Classification scores are listed in rows in Table 5. Overall, the scores are lower than the scores achieved in Korean. Considering that the scores are lower even for the supervised classifier, it seems that character bi-grams are less effective in Japanese than Korean. As expected from the size of the foreign seed, recall of the unsupervised classifier is quite low. But after adaptation to the lexicon, recall improves significantly and F-score is not that far behind the supervised classifier. 6 Conclusion I proposed an unsupervised method for developing a classifier that identifies loanwords in Korean text. As shown in the experiments discussed above, the method can yield an effective classifier that can be made to perform at a level comparable to that of a supervised classifier. The method is cost-efficient as it does not require language resources other than a large monolingual corpus, a grapheme-to-phoneme converter, and perhaps a lexicon to supplement the corpus. The method is in principle applicable to a wide range of languages, i.e. those that rely on vowel insertion to repair illegal consonant clusters. Results from the pilot experiment in Japanese were encouraging. Future studies will further explore applicability of the method to other languages, especially under-resourced languages. References Baker, K. and Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 08), pages Bali, R.-M., Chong, C. C., and Pek, K. N. (2007). Identifying and classifying 19

21 unknown words in Malay texts. In Proceedings of the 7th International Symposium on Natural Language Processing, pages Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-tophoneme conversion. Speech Communication, 50(5): Breen, J. (2004). JMDict: a Japanese-multilingual dictionary. In Proceedings of the Workshop on Multilingual Linguistic Resources, pages Association for Computational Linguistics. Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In Kingston, J. and Beckman, M., editors, Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, pages Cambridge: Cambridge University Press. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages Goldberg, Y. and Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational Linguistics and Intelligent Text Processing, pages Springer Berlin - Heidelberg. Hagiwara, M. (2013). Chasen format). JEITA public morphologically tagged corpus (in Hall, N. (2011). Vowel epenthesis. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages Malden, MA & Oxford: Wiley-Blackwell. Haspelmath, M. and Tadmor, U. (2009). Loanwords in the World s Languages: A Comparative Handbook. Walter de Gruyter. Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information retrieval. Information Processing and Management, 35: Kang, Y. (2011). Loanword phonology. In van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology, pages Malden, MA & Oxford: Wiley-Blackwell. Khaltar, B.-O. and Fujii, A. (2009). A lemmatization method for Mongo- 20

22 lian and its application to indexing for information retrieval. Information Processing & Management, 45(4): Knight, K. and Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4): Korea Advanced Institute of Science and Technology (1997). Automatically analyzed large scale KAIST corpus [Data file]. Ladefoged, P. (2001). A Course in Phonetics. Orlando: Harcourt Brace, 4 edition. Maddieson, I. (2013). Syllable structure. In Dryer, M. S. and Haspelmath, M., editors, The World Atlas of Language Structures Online, Leipzig. Max Planck Institute for Evolutionary Anthropology. Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language (2011). The 21st century Sejong project [Data file]. NIKL (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. NIKL (2000b). pyojuneo geomtoyong jaryo. Resource document. NIKL (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. NIKL (2000d). yongeon hwalyongpyo. Resource document. NIKL (2008). Survey of the state of loanword usage. [Data file]. NIKL (2013). oeraeeo pyogi yongrye jaryo romaja inmyeonggwa jimyeong. Resource document. Nwesri, A. F. A. (2008). Effective Retrieval Techniques for Arabic text. PhD thesis, RMIT University, Melbourne, Australia. Oh, J.-H. and Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the International Conference on Computer Processing of Oriental Languages, pages Ravi, S. and Knight, K. (2009). Learning phoneme mappings for transliteration without parallel data. In Proceedings of Human Language Technolo- 21

23 gies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages Selkirk, E. (1984). On the major class features and syllable theory. In Aronoff, M. and Oerhle, R. T., editors, Language Sound Structure: Studies in Phonology Presented to Morris Halle by His Teachers and Students, pages Cambridge, MA: MIT Press. Sohn, H.-M. (1999). The Korean Language. Cambridge: Cambridge University Press. Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7): Witten, I. H. and Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4): Yoon, K. and Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Language, 20(4): Yoon, S.-Y., Kim, K.-Y., and Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages

24 Appendix A. Rewrite rules for grapheme-tophoneme conversion The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ [hankɨl]. Letter Phoneme(s) Letter Phoneme(s) Letter Phoneme(s) ᄀ k ᄁ k* ᄂ n ᄃ t ᄄ t* ᄅ (onset) ɾ ᄅ (coda) l ᄆ m ᄇ p ᄈ p* ᄉ s ᄊ s* ᄋ (onset) Null ᄋ (coda) ŋ ᄌ t ʃ ᄍ t ʃ* ᄎ t ʃʰ ᄏ kʰ ᄐ tʰ ᄑ pʰ ᄒ h ㅏ a ㅑ j a ㅐ æ ᅤ j æ ᅥ ʌ ᅧ j ʌ ᅦ e ᅨ j e ᅩ o ᅭ j o ᅪ w a ᅫ w æ ᅬ ø ᅮ u ᅲ j u ᅯ w ʌ ᅰ w e ᅱ w i ᅳ ɨ ᅵ i ᅴ ɨ i 23

25 Table 1: Korean phonemes and their place in the proposed sonority hierarchy. Class Obstruents Sonorants Vowels Phonemes p p* pʰ t t* tʰ k k* kʰ s s* h t ʃ t ʃ* t ʃʰ m n ŋ ɾ l w j a e i o u æ ʌ ø ɨ ɨ i 24

26 Table 2: Data sizes in number of unique words or eojeols. Class SEJONG KAIST NIKL-1 NIKL-2 JEITA JMDict Native unknown unknown 49,962 44,214 unknown 130,237 Foreign unknown unknown 21,176 18,943 unknown 17,891 Total 1,019,863 2,409,309 71,138 63, , ,128 25

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Text Analytics Illustrated with a Simple Data Set

Text Analytics Illustrated with a Simple Data Set CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

Keyboards for inputting Japanese language -A study based on US patents

Keyboards for inputting Japanese language -A study based on US patents Keyboards for inputting Japanese language -A study based on US patents Umakant Mishra Bangalore, India umakant@trizsite.tk http://umakant.trizsite.tk (This paper was published in April 2005 issue of TRIZsite

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

English to Arabic Transliteration for Information Retrieval: A Statistical Approach

English to Arabic Transliteration for Information Retrieval: A Statistical Approach English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts

More information

Comparative Analysis on the Armenian and Korean Languages

Comparative Analysis on the Armenian and Korean Languages Comparative Analysis on the Armenian and Korean Languages Syuzanna Mejlumyan Yerevan State Linguistic University Abstract It has been five years since the Korean language has been taught at Yerevan State

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Log-Linear Models. Michael Collins

Log-Linear Models. Michael Collins Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging 2 Words taken isolatedly are ambiguous regarding its POS Yo bajo con el hombre bajo a PP AQ

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Discovering suffixes: A Case Study for Marathi Language

Discovering suffixes: A Case Study for Marathi Language Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Document Image Retrieval using Signatures as Queries

Document Image Retrieval using Signatures as Queries Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and

More information

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus Jian Zhang Department of Computer Science and Technology of Tsinghua University, China ajian@s1000e.cs.tsinghua.edu.cn

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Reliable and Cost-Effective PoS-Tagging

Reliable and Cost-Effective PoS-Tagging Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de Abstract

More information

Framework for Joint Recognition of Pronounced and Spelled Proper Names

Framework for Joint Recognition of Pronounced and Spelled Proper Names Framework for Joint Recognition of Pronounced and Spelled Proper Names by Atiwong Suchato B.S. Electrical Engineering, (1998) Chulalongkorn University Submitted to the Department of Electrical Engineering

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Sentiment analysis: towards a tool for analysing real-time students feedback

Sentiment analysis: towards a tool for analysing real-time students feedback Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: nabeela.altrabsheh@port.ac.uk Mihaela Cocea Email: mihaela.cocea@port.ac.uk Sanaz Fallahkhair Email:

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Identifying Learning Styles in Learning Management Systems by Using Indications from Students Behaviour

Identifying Learning Styles in Learning Management Systems by Using Indications from Students Behaviour Identifying Learning Styles in Learning Management Systems by Using Indications from Students Behaviour Sabine Graf * Kinshuk Tzu-Chien Liu Athabasca University School of Computing and Information Systems,

More information

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,

More information

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013 ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Using News Articles to Predict Stock Price Movements

Using News Articles to Predict Stock Price Movements Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Things to remember when transcribing speech

Things to remember when transcribing speech Notes and discussion Things to remember when transcribing speech David Crystal University of Reading Until the day comes when this journal is available in an audio or video format, we shall have to rely

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

An Arabic Text-To-Speech System Based on Artificial Neural Networks

An Arabic Text-To-Speech System Based on Artificial Neural Networks Journal of Computer Science 5 (3): 207-213, 2009 ISSN 1549-3636 2009 Science Publications An Arabic Text-To-Speech System Based on Artificial Neural Networks Ghadeer Al-Said and Moussa Abdallah Department

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics? Historical Linguistics Diachronic Analysis What is Historical Linguistics? Historical linguistics is the study of how languages change over time and of their relationships with other languages. All languages

More information

Interpreting areading Scaled Scores for Instruction

Interpreting areading Scaled Scores for Instruction Interpreting areading Scaled Scores for Instruction Individual scaled scores do not have natural meaning associated to them. The descriptions below provide information for how each scaled score range should

More information

Nonparametric statistics and model selection

Nonparametric statistics and model selection Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.

More information

Computer Aided Document Indexing System

Computer Aided Document Indexing System Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Oracle Watchlist Screening

Oracle Watchlist Screening 1 Oracle Watchlist Screening Mike Matthews 3 rd party logo 2 Topics Screening trends & needs Increasing screening data accuracy Reducing false positives Screening international data

More information

CS 533: Natural Language. Word Prediction

CS 533: Natural Language. Word Prediction CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

More information

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de

More information

Evaluation of Bayesian Spam Filter and SVM Spam Filter

Evaluation of Bayesian Spam Filter and SVM Spam Filter Evaluation of Bayesian Spam Filter and SVM Spam Filter Ayahiko Niimi, Hirofumi Inomata, Masaki Miyamoto and Osamu Konishi School of Systems Information Science, Future University-Hakodate 116 2 Kamedanakano-cho,

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin Introduction Context of work: Error-based online failure prediction: error

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots ScreenMatch: Providing Context to Software Translators by Displaying Screenshots Geza Kovacs MIT CSAIL 32 Vassar St, Cambridge MA 02139 USA gkovacs@mit.edu Abstract Translators often encounter ambiguous

More information

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke 1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline

More information

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA Tibetan For Windows - Software Development and Future Speculations Marvin Moser, Tibetan for Windows & Lucent Technologies, USA Introduction This paper presents the basic functions of the Tibetan for Windows

More information

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base 32 Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base Brant N. Kay Brian C. Rineer SAS Institute Inc. SAS Institute Inc. 100 SAS Campus Drive 100 SAS Campus Drive

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012 Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about

More information

Text-To-Speech Technologies for Mobile Telephony Services

Text-To-Speech Technologies for Mobile Telephony Services Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary

More information

TELT March 2014 Exa miners Report

TELT March 2014 Exa miners Report TELT March 2014 Exa miners Report 1. Introduction 101 candidates sat for the TELT March 2014 examination session. 53 candidates were awarded Pass grades or higher. This is the equivalent to 52.5 % pass

More information

Probabilistic topic models for sentiment analysis on the Web

Probabilistic topic models for sentiment analysis on the Web University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information