A Suffix Stripping Algorithm for Odia Stemmer

Size: px

Start display at page:

Download "A Suffix Stripping Algorithm for Odia Stemmer"

Buddy McKenzie
6 years ago
Views:

1 A Suffix Stripping Algorithm for Odia Stemmer Sampa Chaupattnaik, Sohag Sundar Nanda, Sanghamitra Mohanty P.G.Department of Computer Science and Application Utkal University Abstract Stemming is the process for reducing inflected words to their stem. This process involves removing the suffix or prefix attached in a word. As this process includes finding the stem, it is not identical to morphological analysis. Stemming is used for information extraction system to improve the performance. This process reduces the number of terms in information retrieval system. There are various techniques used for stemming. In this paper we present a suffix stripping algorithm for Odia language. Keywords Suffix stripping, Odia, Stemmer I. INTRODUCTION Stemmers are used in information retrieval to reduce as many related words or word form to a common form which is not in base form. For example the English word Organization has different forms such as Organiz, Organized, Organizing, Organizes etc. There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome. 1. Brute-force algorithms: In this stemmers employ a lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned. 1) 2. Suffix-stripping algorithms: Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" is stored which provides a path for the algorithm, given an input word form, to find its root form. 2) 3. Lemmatisation algorithms: This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech. This approach is highly conditional upon obtaining the correct lexical category (part of speech). 3) 4. Stochastic algorithms: This algorithm involves using probability to identify the root form of a word. Stochastic algorithms are trained on a table of root form to inflected form relations to develop a probabilistic model. 4) 5. Affix stemmers :In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. In Odia language we find such type of affixes for noun. For examples: the words, here, are the prefixes used in odia language. Apart from the above techniques for stemming there are several other techniques used. To design a stemmer is a language specific. A very simple stemmer algorithm involves to removing a suffixes using a suffix lists given by the suffix table. II. RELATED WORK An easy Martin Porter developed a well known Porter Stemmer algorithm for English. Porter stemmer uses the fact that English languages suffixes are mostly a combination of smaller and simpler suffixes. Porter designed the stemming algorithm using rule based for English language which consists of five steps. There are other stemming algorithms for English, such as Paice/Husk, Lovins Stemming,Dawson, and Krovetz. The stemming work for Indian languages are also developed. Such languages are Hindi, Marathi, Bengali, Punjabi etc.to the best of authors knowledge this work represents the first published effort to develop a stemmer for Odia. III. STEMMING ALGORITHM FOR ODIA Odia language has strong inflectional system can be classified as nominal inflection and verb inflection. Here we represent the rules using Panini Grammar. Noun inflection: For example here stem and suffix is.the details of nominal suffix are given below. (Table 1) Vol 1 Issue 1 Aug

2 (Inflection) (Singular) (Plural) (Case- (NonCase- Relationship) Relationship), o, o,,,,,, (1 st Inflection) o, o, o, (Subjective) (2 nd Inflection) (3 rd Inflection),,,, o, o,,,,,, ξ,,,, (Objective),, ξ, (Instrumental),,,, ξ,,,,, (4 th Inflection),,,, (Dative),,,,,,,,,, (5 th Inflection),, (Ablative) (6 th Inflection),,,,,,, (Genitive),,, ξ,,, (7 th Inflection), ξ,, ξ,, (Locative) ξ Table 1: List of Nominal Suffixes in Odia Vol 1 Issue 1 Aug

3 Rule -2 /,(honorific) / (honorific) - / Verb inflection: /, For example ଖ ଉଛ here stem and suffix is. The details of verbal suffix are given below. (Table II) Rule-3 /ξ (, / Rule-4 ξ / / - ξ Tense Person ( ) ( Singular Suffix ( ( ) Plural suffix ( ) Present Tense ( ) ( ),,,,, ξ,,, ( ),,,,,,,,,,, ( ),,,, ξ, ξ, ξ,, ξ,, ξ Past Tense, ξ, ξ ( ) ξ, ξ, ξ, ξ, ξ, ξ, ξ,,, ξ ξ ( ) ξ, ξ, ξ, ξ, ξ, ξ,, ( ) ξ, ξ, ξ, ξ, ξ, ξ ξ, ξ,, ξ, ξ,,, (Future Tense) ( ) ξ ξ, ξ,, ξ, ξ,, ( ) ξ, ξ, ξ, ξ, ξ,, ξ ( ) ξ,,, Table 2: Odia Verbal Suffix The rules are as follows; For Nominal Suffixes: Rule 1a: / (v+c) - (+ ) o /o /o - o / o / o Rule-1b: -, Rule-5 Rule-6, - / / -, Rule-7 / Similarly for verbal suffix removal we can refer the Table II. Along with we find some suffixes which is not in the list (Table II) / / For example: = + Vol 1 Issue 1 Aug

4 The suffix stripping algorithm is as follows: Step 1: Input a word Step 2: Remove the suffixes (mentioned by Table-I and II) ε and find the stem. Consider the word,, in this word the suffix is (Objective & plural marker). When the word is parsed in the FSA, the last suffix is identified. It triggers a transition to the same state and in the current word this suffix is stripped.the remaining word is. Whenever the transition is triggered by the suffix, that suffix is stripped from the word and required orthographic corrections are done. By doing this iterative steps we obtain the stem after the removal of all suffixes. In Odia we find some prefixes which is attached only on noun. There are 20 such type of prefixes in Odia. These are basically from Sanskrit. They are as follows: q2 q4 / / Table 4: A sample state transition table for Nominal Suffix Table 3: List of Odia Prefixes Along this we find some local prefix used in odia. They are,,,, etc. / q0 ξ/ξ ξ ξ/ξ ξ Figure 1: State transition(part) diagram for verbal suffix / Current State (Q) (Input Symbol) (Q, σ) :Transition State Current State (Q) (Input Symbol) / (Q, σ) :Transition State q0 q1 q0 q2 q0 q4 ξ /ξ ξ ξ /ξ ξ q0, / q1 Table 5: A sample state transition table for verbal Suffix o /o /o IV.CONCLUSION We have presented a stemmer for Odia, a morphologically rich language using Finite State Transducer (FST), as the Vol 1 Issue 1 Aug

5 word formation is strictly based on the rules of morphology. This performs with an accuracy of 88 %. o q 1 o /o o /o o q 0 / / / q 2 o ε q 4 Figure 2: State transition(part) diagram for nominal suffix REFERENCES [1]Amaresh Kumar Pandey, Tanveer J Siddiqui, An unsupervised Hindi stemmer with heuristic improvements, In the Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, July 24, pp , ACMInternational Conference Proceeding Series, [2] R. Wicentowski, Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model, In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp , [3] Akshar Bharat, Rajeev Sangal, S. M. Bendre, Pavan Kumar and Aishwarya, Unsupervised improvement of morphological analyzer for inflectionally rich languages, Proceedings of the NLPRS, pp , [4] Madhavi Ganapathiraju and Levin Lori, TelMore: Morphological Generator for Telugu Nouns and verbs, In the proceedings of Second International Conference on Universal Digital Library Alexandria, Egypt, November 17-19, Vol 1 Issue 1 Aug

Discovering suffixes: A Case Study for Marathi Language

Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural