Data-Driven Spell Checking: The Synergy of Two Algorithms for Spelling Error Detection and Correction

Similar documents
A Mixed Trigrams Approach for Context Sensitive Spell Checking

Implementation of Internet Domain Names in Sinhala

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

The Design of a Proofreading Software Service

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

LuitPad: A fully Unicode compatible Assamese writing software

Word Completion and Prediction in Hebrew

Machine Translation. Agenda

Your single-source partner for corporate product communication. Transit NXT Evolution. from Service Pack 0 to Service Pack 8

Turkish Radiology Dictation System

Text-To-Speech Technologies for Mobile Telephony Services

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Processing: current projects and research at the IXA Group

Q&As: Microsoft Excel 2013: Chapter 2

SDL Trados Studio 2015 Translation Memory Management Quick Start Guide

Reading Readiness Online

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

ECDL / ICDL Word Processing Syllabus Version 5.0

A POS-based Word Prediction System for the Persian Language

Keyboards for inputting Japanese language -A study based on US patents

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

A Natural Language Query Processor for Database Interface

English to Arabic Transliteration for Information Retrieval: A Statistical Approach

AUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus

Designing forms for auto field detection in Adobe Acrobat

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Micro blogs Oriented Word Segmentation System

Word processing software

Oracle Database 11g SQL

GCE. Computing. Mark Scheme for January Advanced Subsidiary GCE Unit F452: Programming Techniques and Logical Methods

Phonetic Models for Generating Spelling Variants

SMSFR: SMS-Based FAQ Retrieval System

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

IENG2004 Industrial Database and Systems Design. Microsoft Access I. What is Microsoft Access? Architecture of Microsoft Access

PARLIAMENT OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA

Review of Hashing: Integer Keys

How To Write A Phonetic Spelling Checker For Brazilian Pruirosa Pessoa

DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH

PDF Accessibility Overview

Training Needs Analysis

Synergy Controller Application Note 4 March 2012, Revision F Tidal Engineering Corporation Synergy Controller Bar Code Reader Applications

Quality Companion 3 by Minitab

Content Management System

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Recognizing Non-Translatable Symbols in a Multi-Lingual Computer-Assisted Translation System for DTP Documents

Microsoft Office PowerPoint Identify components of the PowerPoint window. Tutorial 1 Creating a Presentation

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Excel What you will do:

May Training Guide

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

HIT THE GROUND RUNNING MS WORD INTRODUCTION

Internationalized Domain Names -

RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI

The Re-emergence of Data Capture Technology

Towards Unsupervised Word Error Correction in Textual Big Data

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

PHONETIC TOOL FOR THE TUNISIAN ARABIC

An Arabic Text-To-Speech System Based on Artificial Neural Networks

GDP11 Student User s Guide. V. 1.7 December 2011

news from Tom Bacon about Monday's lecture

Using Microsoft Word. Working With Objects

Easy Bangla Typing for MS-Word!

Programming with SQL

Setting Up OpenOffice.org: Choosing options to suit the way you work

Creating A Simple Dictionary With Definitions

Word 2007 Unit B: Editing Documents

Localization of Text Editor using Java Programming

The Benefits of Invented Spelling. Jennifer E. Beakas EDUC 340

Creating Reports Crystal Clear

Reading Competencies

Beginning Microsoft Access

Extraction Transformation Loading ETL Get data out of sources and load into the DW

ECDL. European Computer Driving Licence. Word Processing Software BCS ITQ Level 2. Syllabus Version 5.0

Introduction to IBM Watson Analytics Data Loading and Data Quality

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Keywords : complexity, dictionary, compression, frequency, retrieval, occurrence, coded file. GJCST-C Classification : E.3

NetClient CS Document Management Portal User Guide. version 9.x

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

Knocker main application User manual

User Management Resource Administrator 7.2

1.0 Getting Started Guide

USER GUIDE for LEAD AUDITORS

The National Reading Panel: Five Components of Reading Instruction Frequently Asked Questions

Blackboard Help. Getting Started My Institution Tab Courses Tab Working With Modules Customizing Tab Modules Course Catalog.

Outlook Web Access (OWA) Cheat Sheet

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Transcription:

Data-Driven Spell Checking: The Synergy of Two Algorithms for Spelling Error Detection and Correction Eranga Jayalatharachchi, Asanka Wasala*, Ruvan Weerasinghe University of Colombo School of Computing, 35, Reid Avenue, Colombo 00700, Sri Lanka *Localisation Research Centre CSIS Department, University of Limerick, Limerick, Ireland {dej,arw}@ucsc.cmb.ac.lk,*asanka.wasala@ul.ie 1

Contents 1. Introduction 2. Background Sinhala Language Work on Indian Languages Work on Sinhala 3. Methodology Subasa v1 Subasa v2 4. Evaluation 5. Conclusions & Future Work 6. Demonstration 2

Introduction Spell Checking The task of identifying and flagging incorrectly spelled words in a document written in a natural language Spell Correcting The process of replacing the misspelled words with the most likely intended ones Applications Word processing, optical character recognition (OCR), character recognition, speech recognition, computer aided language learning (CALL) etc. 3

Introduction Misspelled Words Non-word errors It was teh wind Real-word errors My sun is a doctor Automatic Spelling Error Detection and Correction (Kukich 1992):. 1. Non-word error detection 2. Isolated word error correction 3. Context-dependent error correction 4

Introduction About 80% of all misspelled English words (non-word errors) in human typewritten text are due to single-error misspellings. (Damerau 1964) ther insertion teh transposition the th deletion thw substitution 5

Introduction Correction Techniques (Kukich. 1992) 1. Minimum edit distance techniques 2. Similarity key techniques 3. Rule-based techniques 4. N-gram-based techniques 5. Probabilistic techniques 6. Neural nets 6

Objective Introduction To enhance Subasa, the only documented spell checker available to-date for Sinhala (Wasala et al. 2010; Walasa et al. 2011) Subasa v1 : n-gram Subasa v2: n-gram + edit distance 7

N-grams Introduction An n-gram is a sub-sequence of n items from a given sequence Word intention Letter unigrams i n t e n t i o n Letter bi-grams Letter tri-grams in nt te en nt ti io on int nte ten ent nti tio ion 8

Introduction N-gram Generating Algorithm function get_n_grams (word, n) returns n_grams_list l length (word) - n n_grams_list empty () for i from 0 to l do n_grams_list append ( substring (word, i, n) ) 9

Minimum Edit-Distance Introduction Minimum number of editing operations required to transform one string to another Insertions Deletions Substitutions (Wagner 1974) 10

Editing Operations Introduction i n t e n t i o n i n t e n t i o n e x e c u t i o n e x e c u t i o n 5 Substitutions 1 Deletion Cost = 5 x 2 = 10 Cost of Edit Operations Insertion = 1 Deletion = 1 Substitution = Deletion + Insertion = 1 + 1 = 2 3 Substitutions 1 Insertion Cost = 1 + (3 x 2) + 1 = 8 11

Introduction Minimum Edit Distance Calculation Algorithm A dynamic programming algorithm for minimum edit-distance computation creates an edit-distance matrix M with one column for each symbol in the target sequence and one row for each symbol in the source sequence. function minimum_edit_distance (source, target) returns min_distance m length(source) n length(target) create distance matrix M[n+1,m+1] M[0,0] 0 for each column i from 0 to n do for each row j from 0 to m do M[i,j] min ( M[i-1,j] + cost_insert(target i ), M[i-1,j-1] + cost_substitute(source j, target i ), M[i,j-1] + cost_delete(source j ) ) min_distance M[i+1,j+1] 12

source Edit Distance Matrix Introduction n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 10 11 t 3 4 5 6 7 8 9 10 11 12 2 n 2 3 4 5 6 7 8 8 10 11 1 i 1 2 3 4 5 6 7 8 9 10 0 2 1 3 2 # 0 1 2 3 4 5 6 7 8 9 target # e x e c u t i o n Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 13

source Edit Distance Matrix Introduction n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 10 11 t 3 4 5 6 7 8 9 10 11 12 n 2 3 4 5 6 7 8 8 10 11 i 1 2 3 4 5 6 7 8 9 10 # 0 1 2 3 4 5 6 7 8 9 target # e x e c u t i o n Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 14

Background Sinhala Language & Script Majority language of Sri Lanka Sinhala script is a derivative of Brahmi script Sinhala script is an syllabic script 5 pre-nasalized stops & 2 unique vowels (Nandasara, 2009) Sinhala is a phonetic language na-na-la-la dissention Conjunct letters 15

Background Work on Indic Languages Non-word spelling correction for Assamese (Das et al. 2002) Uses similarity-key and minimum edit distance techniques Rule cum Dictionary based approach for spell checking Malayalam (Santhosh et al. 2002) Spelling correction for Tamil (Dhanabalan et al. 2003) Non-word error detection using simple dictionary lookups Spell checking for Bangla (Chaudhuri 2002) An adaptation of similarity key based technique 16

Background Work on Sinhala Language Thibus Commercial-grade Mozilla Firefox Extension (addons.mozilla.org) Dictionary-based OpenOffce Extension (openoffice.org) Uses Hunspell Microsoft Office Word 2007 (microsoft.com) Via Language Interface Pack (LIP) for Sinhala Subasa (v1) (Wasala et al. 2009; Wasala et al. 2010) N-gram based Phonetic errors 17

Methodology: Subasa v1 The Process (k, c) kat kat cat 18

Methodology: Subasa v1 The Process (contd.) kat cat ka, at ca, at ka, at = 10+5 ca, at = 20+5 kat cat ka = 10 ca = 20 at = 5 cat 19

Methodology: Subasa v1 Phoneme Classes Graphemes Phoneme class, /k/, /g/, /tʃ/, /dʒ/, /ʈ/, /ɖ/, /t / Graphemes Phoneme class, /d /, /p/, /b/, /n/, /l/,, /s/ or /ʃ/, /ɲ/ 20

Example Methodology: Subasa v1 UCSC Corpus 10 Mn Words Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (16,6460) Dictionary of Sinhala Spelling (Koparahewa. 2006) 21

http://subasa.ambitiouslemon.com/ 22

The Process Methodology: Subasa v2 23

Methodology: Subasa v2 The Process : Edit Distance Module 24

Methodology: Subasa v2 Data UCSC Corpus 10 Mn Words Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (166,460) Dictionary of Sinhala Spelling (Koparahewa 2006) Word Unigrams (spell checked by Subasa v1) 25

Methodology: Subasa v2 New Phoneme Classes 26

http://subasa.ambitiouslemon.com/subasa2/ 27

Evaluation Compared with: Microsoft Word 2007 Sinhala Language Interface Pack 2007 for Microsoft Office OpenOffice.org 3.2 Writer based on Hunspell Subasa v1 based on n-grams from UCSC Corpus Manual Inspection by a linguist Test cases Test 1: Public Sinhala Newspaper Test 2: Sinhala Blog Syndicator 28

Results: Test 1 Evaluation 6155 words from a Public Sinhala Newspaper http://www.divaina.com/2010/10/28/ Incorrect Words Detected Correct Words Detected Word 2830 46% 3325 54% Writer 1592 26% 4563 74% Subasa v1 255 4% 5900 96% Subasa v2 808 13% 5347 87% Manual 1055 17% 5100 83% 29

Results: Test 2 Evaluation 4117 words extracted from a Sinhala blog syndicator http://blogs.sinhalabloggers.com/ Incorrect Words Detected Correct Words Detected Word 1979 48% 2138 52% Writer 1494 36% 2623 64% Subasa v1 353 9% 3764 91% Subasa v2 953 23% 3164 77% Manual 1047 25% 3070 74% 30

Conclusions and Future Work Conclusions Subasa v2 performs much closer to Manual inspection N-gram + Edit distance is better than n-gram only approach Data driven Good for languages with limited resources 31

Conclusions and Future Work Future Works Larger dictionary Optimizations to Edit Distance module Candidate correction ranking Word boundary analysis Morphological analysis 32

Demonstration http://subasa.ambitiouslemon.com/ & http://subasa.ambitiouslemon.com/subasa2/ 33

Improved Detections Subasa v1 Subasa v2 34

Improved Corrections Subasa v1 Subasa v2 35