Data-Driven Spell Checking: The Synergy of Two Algorithms for Spelling Error Detection and Correction

Data-Driven Spell Checking: The Synergy of Two Algorithms for Spelling Error Detection and Correction Eranga Jayalatharachchi, Asanka Wasala*, Ruvan Weerasinghe University of Colombo School of Computing, 35, Reid Avenue, Colombo 00700, Sri Lanka *Localisation Research Centre CSIS Department, University of Limerick, Limerick, Ireland {dej,arw}@ucsc.cmb.ac.lk,*asanka.wasala@ul.ie 1

Contents 1. Introduction 2. Background Sinhala Language Work on Indian Languages Work on Sinhala 3. Methodology Subasa v1 Subasa v2 4. Evaluation 5. Conclusions & Future Work 6. Demonstration 2

Introduction Spell Checking The task of identifying and flagging incorrectly spelled words in a document written in a natural language Spell Correcting The process of replacing the misspelled words with the most likely intended ones Applications Word processing, optical character recognition (OCR), character recognition, speech recognition, computer aided language learning (CALL) etc. 3

Introduction Misspelled Words Non-word errors It was teh wind Real-word errors My sun is a doctor Automatic Spelling Error Detection and Correction (Kukich 1992):. 1. Non-word error detection 2. Isolated word error correction 3. Context-dependent error correction 4

Introduction About 80% of all misspelled English words (non-word errors) in human typewritten text are due to single-error misspellings. (Damerau 1964) ther insertion teh transposition the th deletion thw substitution 5

Introduction Correction Techniques (Kukich. 1992) 1. Minimum edit distance techniques 2. Similarity key techniques 3. Rule-based techniques 4. N-gram-based techniques 5. Probabilistic techniques 6. Neural nets 6

Objective Introduction To enhance Subasa, the only documented spell checker available to-date for Sinhala (Wasala et al. 2010; Walasa et al. 2011) Subasa v1 : n-gram Subasa v2: n-gram + edit distance 7

N-grams Introduction An n-gram is a sub-sequence of n items from a given sequence Word intention Letter unigrams i n t e n t i o n Letter bi-grams Letter tri-grams in nt te en nt ti io on int nte ten ent nti tio ion 8

Introduction N-gram Generating Algorithm function get_n_grams (word, n) returns n_grams_list l length (word) - n n_grams_list empty () for i from 0 to l do n_grams_list append ( substring (word, i, n) ) 9

Minimum Edit-Distance Introduction Minimum number of editing operations required to transform one string to another Insertions Deletions Substitutions (Wagner 1974) 10

Editing Operations Introduction i n t e n t i o n i n t e n t i o n e x e c u t i o n e x e c u t i o n 5 Substitutions 1 Deletion Cost = 5 x 2 = 10 Cost of Edit Operations Insertion = 1 Deletion = 1 Substitution = Deletion + Insertion = 1 + 1 = 2 3 Substitutions 1 Insertion Cost = 1 + (3 x 2) + 1 = 8 11

Introduction Minimum Edit Distance Calculation Algorithm A dynamic programming algorithm for minimum edit-distance computation creates an edit-distance matrix M with one column for each symbol in the target sequence and one row for each symbol in the source sequence. function minimum_edit_distance (source, target) returns min_distance m length(source) n length(target) create distance matrix M[n+1,m+1] M[0,0] 0 for each column i from 0 to n do for each row j from 0 to m do M[i,j] min ( M[i-1,j] + cost_insert(target i ), M[i-1,j-1] + cost_substitute(source j, target i ), M[i,j-1] + cost_delete(source j ) ) min_distance M[i+1,j+1] 12

source Edit Distance Matrix Introduction n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 10 11 t 3 4 5 6 7 8 9 10 11 12 2 n 2 3 4 5 6 7 8 8 10 11 1 i 1 2 3 4 5 6 7 8 9 10 0 2 1 3 2 # 0 1 2 3 4 5 6 7 8 9 target # e x e c u t i o n Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 13

source Edit Distance Matrix Introduction n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 10 11 t 3 4 5 6 7 8 9 10 11 12 n 2 3 4 5 6 7 8 8 10 11 i 1 2 3 4 5 6 7 8 9 10 # 0 1 2 3 4 5 6 7 8 9 target # e x e c u t i o n Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 14

Background Sinhala Language & Script Majority language of Sri Lanka Sinhala script is a derivative of Brahmi script Sinhala script is an syllabic script 5 pre-nasalized stops & 2 unique vowels (Nandasara, 2009) Sinhala is a phonetic language na-na-la-la dissention Conjunct letters 15

Background Work on Indic Languages Non-word spelling correction for Assamese (Das et al. 2002) Uses similarity-key and minimum edit distance techniques Rule cum Dictionary based approach for spell checking Malayalam (Santhosh et al. 2002) Spelling correction for Tamil (Dhanabalan et al. 2003) Non-word error detection using simple dictionary lookups Spell checking for Bangla (Chaudhuri 2002) An adaptation of similarity key based technique 16

Background Work on Sinhala Language Thibus Commercial-grade Mozilla Firefox Extension (addons.mozilla.org) Dictionary-based OpenOffce Extension (openoffice.org) Uses Hunspell Microsoft Office Word 2007 (microsoft.com) Via Language Interface Pack (LIP) for Sinhala Subasa (v1) (Wasala et al. 2009; Wasala et al. 2010) N-gram based Phonetic errors 17

Methodology: Subasa v1 The Process (k, c) kat kat cat 18

Methodology: Subasa v1 The Process (contd.) kat cat ka, at ca, at ka, at = 10+5 ca, at = 20+5 kat cat ka = 10 ca = 20 at = 5 cat 19

Methodology: Subasa v1 Phoneme Classes Graphemes Phoneme class, /k/, /g/, /tʃ/, /dʒ/, /ʈ/, /ɖ/, /t / Graphemes Phoneme class, /d /, /p/, /b/, /n/, /l/,, /s/ or /ʃ/, /ɲ/ 20

Example Methodology: Subasa v1 UCSC Corpus 10 Mn Words Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (16,6460) Dictionary of Sinhala Spelling (Koparahewa. 2006) 21

http://subasa.ambitiouslemon.com/ 22

The Process Methodology: Subasa v2 23

Methodology: Subasa v2 The Process : Edit Distance Module 24

Methodology: Subasa v2 Data UCSC Corpus 10 Mn Words Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (166,460) Dictionary of Sinhala Spelling (Koparahewa 2006) Word Unigrams (spell checked by Subasa v1) 25

Methodology: Subasa v2 New Phoneme Classes 26

http://subasa.ambitiouslemon.com/subasa2/ 27

Evaluation Compared with: Microsoft Word 2007 Sinhala Language Interface Pack 2007 for Microsoft Office OpenOffice.org 3.2 Writer based on Hunspell Subasa v1 based on n-grams from UCSC Corpus Manual Inspection by a linguist Test cases Test 1: Public Sinhala Newspaper Test 2: Sinhala Blog Syndicator 28

Results: Test 1 Evaluation 6155 words from a Public Sinhala Newspaper http://www.divaina.com/2010/10/28/ Incorrect Words Detected Correct Words Detected Word 2830 46% 3325 54% Writer 1592 26% 4563 74% Subasa v1 255 4% 5900 96% Subasa v2 808 13% 5347 87% Manual 1055 17% 5100 83% 29

Results: Test 2 Evaluation 4117 words extracted from a Sinhala blog syndicator http://blogs.sinhalabloggers.com/ Incorrect Words Detected Correct Words Detected Word 1979 48% 2138 52% Writer 1494 36% 2623 64% Subasa v1 353 9% 3764 91% Subasa v2 953 23% 3164 77% Manual 1047 25% 3070 74% 30

Conclusions and Future Work Conclusions Subasa v2 performs much closer to Manual inspection N-gram + Edit distance is better than n-gram only approach Data driven Good for languages with limited resources 31

Conclusions and Future Work Future Works Larger dictionary Optimizations to Edit Distance module Candidate correction ranking Word boundary analysis Morphological analysis 32

Demonstration http://subasa.ambitiouslemon.com/ & http://subasa.ambitiouslemon.com/subasa2/ 33

Improved Detections Subasa v1 Subasa v2 34

Improved Corrections Subasa v1 Subasa v2 35