Database search improvement with Nota s Book Catalogue as a case study

Transcription

1 Database search improvement with Nota s Book Catalogue as a case study Esther Sebastián Liso IT & Cognition University of Copenhagen zxv330@alumni.ku.dk February 11, 2015 Abstract Nota is Denmark s oldest library for people with reading difficulties, especially dyslexics and blind people. However, the search engine on Nota s website is not adapted to the need of the users. This shows in Nota s website statistics by the fact that in many queries users do not find the books they are looking for. The current project analyses the spelling errors of Nota s users, tries to find patterns in these errors made due to the fact that many users are dyslexics. Furthermore, it gives some suggestions to improve Nota s search engine to meet adapt it to the challenges of dyslexics. 1 Introduction Nota is a governmental digital library financed by the Ministry of Culture in Denmark. Nota produces audio books and e-books, which can be fetched on-line. In Denmark, there is not a registration of disabled people, so it is not possible to know exactly how many blind, severely visually impaired and dyslexic people there are. It is estimated by Nota and the Ministry that there are about severely visually impaired people in Denmark, of which most of them are older than 70 years old 1. In addition, it is assumed that Danish are dyslexic 2, mostly young people. Nota has a website with the full catalogue of books that can be downloaded or listened to on-line. This catalogue is called e17 3 and there we can find a normal 1 Nota s blinds members are about Nota has more than dyslexic members 3 1

2 search engine and an advanced search engine to look for specific books. Although people with reading difficulties have problems spelling words, Nota s on-line search engine is not good at recognizing when words are misspelled, so it is very difficult for users to find the books they are looking for 4. The quality of users searches varies a lot from very specific data entries (e.g. Psykiske sygdomme og problemer hos børn og unge, Gyldendal og Politikens Danmarkshistorie: Da Danmark blev Danmark (3) or Kultur og etnicitet på arbejde : professionelt arbejde i det flerkulturelle samfund ) to non-sense strings (e.g. zfb43qns, rtf or Isleifsun ). But what is very prominent is the quantity of spelling mistakes we can find (e.g. for vild med dansk, we can find: vil med dansk, vild med dans, vild med danske, vild men dansk ). In order to enable users to find the books they desire, it is necessary to process their searches and normalise the words that make the search noisy (or ill-formed words), so that they correspond to a word that can be found (or canonical words). Since the ill-formed words vary a lot in their nature, it is not easy to normalise them. A great amount of Nota s users are dyslexics, therefore we will focus on normalising those words that are unintentionally misspelled, like a spell-checker would do 5. Damerau (1964) suggested that most of the errors (80%) that dyslexics made were simple errors, because they were just one edit away from their canonical form. However, other more recent studies show that it might be fewer spelling mistakes (53%) that just differ in one letter (Pedler, 2007). My objective is to find the ill-formed words in users searches and convert them into canonical forms so that the users can find the books they wanted. Since we are dealing with an online book catalogue, we will also find many words that are out-of-vocabulary (OOV) such as proper names (e.g. Harry Potter or Einar Már Gudmundsson ) or made-up words (e.g. Silmarillion ). 2 Related Work There are different approaches to text normalisation that have been used to convert ill-formed words into their canonical forms. Many authors have addressed the problem of text normalization. In 1948, Shannon developed the noisy channel model and, more recently, Brill and Moore (2000), among others, generated a list of corrections for every misspelled word by slicing the strings in different parts and ranked the words in those 4 almost 75% of the searches do not lead to a book according to Nota s statistics. 5 There are many authors that have developed spell-checkers such as Peterson (1980) or Jurafsky and Martin (2009) 2

3 lists by their posterior probabilities of the form P(T S). However, spell-checking has not only been used to retrieve misspelled words in general as Brill and Moore (2000) did, but there has also been made to recover specifically dyslexic misspelled words (Pedler, 2007). An aspect that we need to take into consideration when finding misspelled words in our data set is that often they will be hapaxes 6, so it is not very probable that the same mistakes are repeated in the future. So, just having a human annotating a certain amount of words (even a big amount) or a list of corrections with probabilities might not help us to find the correct word. According to Damerau (1964) and Pedler (2007), people who suffer dyslexia often make the same type of spelling mistakes. These errors can be classified as follows: substitution, as puramide, japannk ); insertion, such as sidste timme or matrematik ; omission, some examples are fisk ri and muslimske rel gion ; and transposition, such as salg og servcie or poilti ). In recent years, many authors including Cook and Stevenson (2009) and Choudhury et al. (2007) applied hidden Markov model state transitions and emissions to the task of normalisation. However, since we are just looking at single words in most cases instead of sentences, we will not consider applying the hidden Markov model for this purpose. We will follow the instructions for the experiments that Han et al. (2011) carried out. We will conduct a normalization task in single words and see the distribution of errors in the data set we have selected from Nota s book catalogue. We will detect all ill-formed words and generate a confusion set with different numbers of candidates for every OOV word. Finally, we will select one of these candidates as the word that the person most probably intended to spell. Han et al. (2011) found that word similarity achieved higher precision and recall than context support, which supports the idea of Damerau (1964) and Pedler (2007) of most of the errors emanate of morphophonemic variations, namely substitution, insertion, omission and transposition among others. Hat et al. (2011) found that the best combination to normalise words was combining dictionary lookup in a first place and then checking word similarity transformations and context support. However, due to the nature of our data set, we will not take into account the support of the context, since many words are proper names or have no context at all. Furthermore, Hat et al. (2011) found many limitations when handling context features, since the context used to be very noisy and it was not possible to extract information to normalise ill-formed words. We will also apply the method of Han and Baldwin (2012): using a lexical normalization dictionary with canonical forms. This dictionary will be constructed 6 in other words, we will just find these words once in our corpus 3

4 by the titles and authors of Nota s book catalogue. This catalogue contains all the possible words users can find in Nota s search engine, namely titles and authors of their books. This dictionary will be used as reference to detect the ill-formed words by checking whether the words that will be generated in the confusion set for every ill-formed word are in the normalisation dictionary. For this task, we will only select the words that are in the normalisation dictionary as possible candidates and not all generated words from the confusion set. 3 Method The method we applied to normalise the ill-formed words and find their canonical forms includes the following steps: Firstly, we construct a lexical normalisation dictionary with canonical forms using Nota s book catalogue (titles and authors) as a reference. Secondly, we extract all kinds of keywords that users typed in Nota s search engine in order to construct a data set of words to be spell-checked. Then, we look for the OOV words by comparing the instances in our data set with those in Nota s book catalogue (or normalisation dictionary). Those words that are not in the normalisation dictionary are in the ill-formed words group. For these words, we generate a confusion set by adding all possible combinations following the most common spelling mistakes by people who suffer from dyslexia (by transposition, insertion, omission and substitution of one or two letters in the same word). Finally, we selected the best candidate by 1) only taking into account those candidates that are in the normalisation dictionary and 2) selecting the candidate that has the maximum probability to be the canonical form of the ill-formed word. 3.1 Construction of the lexical normalisation dictionary Even though it has been argued that dictionary lookup approaches to normalisation show high precision but low recall (Han and Baldwin, 2011, Han et al among others), we used this method since my dictionary already contains the full set of canonical words. The reason why we chose to do this is that all the possible searches that will lead to a book should appear in the lexical normalisation dictionary (or Nota s book catalogue) and those words, although correctly spelled words (false negatives), that do not appear in Nota s database, will also lead to an error in users searches. Considering ill-formed words, all words that do not appear in Nota s book catalogue (although they are not ill-formed), precision will be 100%. In addition, since every token instance of a given type is always normalised 4

5 into the same word, it is said that dictionary lookup is a type-based approach to normalisation (Cook and Stevenson, 2009 and Han et al. 2012). This means that an ill-formed word will always be corrected into the same word. If we take into account that most of the errors that appear in the data set are dyslexic errors and, therefore, unintentional errors 7, it should not be problematic that a string like sgole is in all contexts normalised into the word skole and not sole. The input for the lexical normalisation dictionary comprises titles and authors of the books from Nota s book catalogue. These titles and authors have been tokenised and lower-cased. Furthermore, stop words have been excluded. The output is the list of words that is being used to check whether users keywords in Nota s search engine are correctly spelled or not. This means that, if the words are correctly spelled, they will appear in the normalisation dictionary. If they are not correctly spelled, they will not be found in the normalisation dictionary. 3.2 Construction of the data set We extracted a corpus of keywords from Google Analytics 8. The corpus includes searches from January 2014 to September All keywords have been lower-cased and tokenised. Those searches that contained the same words, for example En flænge i himlen, were grouped 9, so that it was not necessary to normalise several times the same words. Figure 1 shows the frequency distribution of the 200 most common searches in our data set 10. Most searches were, therefore, unique. This is why training a system in our data set with probabilities would not be very useful in this context, since most of the words would not be seen during a training. Bearing in mind limited time to process all the information, we randomly selected of the keywords to work with in the following steps Ill-formed word detection Once the data set is created with keywords from users, we need to identify which words are ill-formed and need to be normalised. All the words that can be considered ill-formed are those that are not in the dictionary. Furthermore, following Han and Baldwin (2011), we normalised only single-token words. In practice, this means that keywords like vampyr kongen (split words) or småkillinger 7 In contrast with abbreviations, words to express sentiment, etc. that tend to be intentional and more difficult to normalise There were actually no misspelled words in the most common searches. 10 These searches took place at least twice. 5

6 Figure 1: Frequency distribution of the 200 most common searches from the different keywords typed in Nota s search engine. Those that are not shown in the graph had just two or one instances. (run-outs) will not be able to be normalised since the error is out of the word boundary 11. Figures from Google Analytics show that in the same period as the keywords were extracted, only 28.94% of the users were able to find a book in Nota s book catalogue 12 after their first search and, that an extra 19.63% were able to find a book after their second search. This means that most of the users have not found the book they desired after two searches in Nota s search engine 13. We compared the keywords from my data set with the words from the normalisation dictionary. Those words that matched in the dictionary where considered correctly spelled. Those words that did not have a corresponding word in the dictionary could belong to two different categories: it either means that 1) the word is spelled incorrectly or 2) the search words are correctly spelled, but the user looked for a book which did not exist in Nota s book catalogue. In both cases, the words were analysed and a confusion set was generated for them. 11 Following Pedler (2007), this kind of infractions are only 8% of the dyslexic errors. 12 We cannot be sure whether they found the book they desired or just another book. 13 We must also take into account that Nota offers the possibility to search books through Google s search engine, which already gives some suggestions to users when they misspell a word. This fact does increase the success rate of finding a book in the statistics of Google Analytics. 6

7 3.2.2 Confusion set generation The confusion set is a list of generated strings from an ill-formed keyword (by transposition, insertion, omission and substitution of one or two letters in the same word), whose aim is to find the correct form of the keyword and correct it. These strings included as well one as two characters edit distance of the given word. All possible combinations 14 are taken into consideration. Nevertheless, only those candidates that are found in the dictionary are added to the confusion set. After one edit, many words had no candidate, which means that from all possible combinations, none of the strings belonged to the dictionary, so the word remained unchanged. However, almost half of the words only had one candidate and some words had several candidates, reaching up to 15 candidates 15. For those words that after one edit had no candidates, we looked at two edits of distance. This did not help much since most of them kept having no candidates and many words, especially short words, had too many candidates (up to 142 in one case). Therefore, the probabilities to pick the correct canonical form of these kinds of word were very low Candidate selection From the confusion set, we need to select the best candidate word, which is the candidate word that will replace the ill-formed word and should be suggested to the user in the first place. This candidate should be the word that the user intended to spell. When there is only a word, it is that word that is suggested to replace the original ill-formed word. However, when there are more than one word, we use the method max(), which returns the largest item in an iterable or the largest of two or more arguments 16. In the case that there is more than one candidate, the first candidate suggested is selected and should be presented to the user. 4 Analysis and Results From the keywords that were taken into consideration for the analysis, were found in Nota s book catalogue and had no equivalents. These words were considered incorrect keywords since these searches did not 14 These combinations are created taking into account all letters in the alphabet, no number or special characters are included. 15 for example, bood had the following candidates: wood, good, bold, boyd, food, blod, blood, bodo, boo, bord, bod, book, hood, bond, all appeared in Nota s book catalogue 16 retrieved from on February, the 1st. 7

8 lead the users to a book. From these incorrect words, words (44.39%) that consisted of three or more letters, could be corrected after using a spell-checker. Meanwhile, words (55.61%) were not found after two edits in the spelling of the word. However, many three-letter words were not as reliable as longer words. There were many examples of three-letter keywords that had not apparent meaning (e.g. vbk, flr or sdf). These were corrected and thus changed to other words (e.g. vik, for and adf). Despite this correction, these words were considered correct, since they were found in Nota s book catalogue. However, there is a high probability that it was not those words that the user intended to spell. If we look at words that contained four letters or more, the results are slightly worse. Only 39.56% of the words were corrected after one edit and 40.17% after two edits. We can see in Table 1 that the F-score is not very good. However, if we compare the score with Han and Baldwin s (2011) word similarity and context support results, our results are a bit better. In contrast, if we compare these results with their dictionary lookup results (or dictionary lookup, word similarity and context support), these results are not as good as theirs. 3+ letters 4+ letters after 1 edit after 2 edits after 1 edit after 2 edits precision recall F-score Table 1: Precision, recall and F-score depending of the length of the words taken into account and the amount of edits. We can see in Table 1 that the results do not improve much after two edits in comparison with the words that were just edited once. In addition, these correction tasks provided such a big amount of possible candidates that many of the chosen words might be false positives. As we can see in Table 2, most of the errors (79.2%) were corrected after a replacing rule, where one letter of the word was replaced with another letter (e.g. puramide, harre potter, filisofi or japannk ). There were also some examples of transposes (16.6%) where two letters could be interchanged (e.g. salg og servcie, poilti or natruvidenskab ) and not so many examples where we could find some inserts or deletes (2.1% respectively) where users inserted an extra letter (e.g. sidste timme, civile retspleje, anderkende or matrematik ) or forgot a letter ( fisk ri, muslimske rel gion, samfun sfag or forskning metode ). It was not possible to find a correct word for more than half of the words (55.61%) that were considered ill-formed. However, looking more deeply into the 8

9 type of error number of corrected words percentage of errors of this kind substitution % transposition % insertion % omission % Table 2: Number and percentage of the most common dyslexic errors in users of e17 when typing the books they were looking for in Nota s search engine. kind of mistakes users made, it was very common 17 that they wrote a compounded word into two different words (e.g. sundheds sociologi, vampyr kongen or skygge porten ). Other keywords that could not be corrected were actually numbers (probably books IDs, which were not included in Nota s book catalogue) A problem of this method was how to decide which of the candidate was the best one when there were many candidates,. For example, we could find the illformed word ote, which was not found in Nota s book catalogue. This word was corrected to ode by substitution of one letter. However, it can be discussed whether the user meant ode or otte 18, which was other of the possible candidates. The max() method in these cases preferred the words derived by substitution instead of the other methods and, as we can see in Table 2, there are also more corrections made by substitution than by other derivations. 5 Conclusions and discussion I have proposed a method for Nota to improve user s experience when searching for books in Nota s book catalogue. This method consists of comparing user s keywords with words in Nota s book catalogue and check whether they were found there or not. If they were not found, there could be some suggestions based on the words that were just one letter of distance away from a word that is in the database. An advantage of this method is that it does not require explicit annotations 19. The most interesting conclusion of this project is that, since people who suffer from dyslexia often make the same kind of mistakes, the processes that can be 17 There were errors of this type, which is 15.88% of the words that were not corrected or 8.85% of the total errors. 18 derived by omission of one letter. 19 The method needs to take into account that most of the words that were out-of-vocabulary and, therefore, considered as ill-formed were existing titles and authors of books, were because Nota did not possess the audio book version of the book with that title. In fact, there were many very specialised books both in English and Danish in the user s searches. Many of these books were not in Nota s book catalogue. 9

10 followed are these four: 1) transposition of two letters, 2) omission, 3) insertion or 4) substitution of just one letter, since there were not many ill-formed words that were more than one edit away from their canonical form 20.. In addition, it is also interesting to check whether there was an extra space between two words, in order to supplement the other four processes, since there were more mistakes of this kind than expected. A problem of our method was similar to the problem experienced by Han et al. (2011), namely that if correct out-of-vocabulary words were identified as illformed, the candidate selection step could never be done correctly. In other words, words like aviser that were not found in the dictionary although they were feasible words, had different candidates, such as avisen or viser and were therefore incorrect after the candidate selection in all cases 21. References [1] Cook, P., Stevenson, S. (2009): An unsupervised model for tet message normalization. In CALC 09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, [2] Han, B., Baldwin, T. (2011): Lexical normalisation of short text messages: makn sens a twitter, In Proceeding HLT 11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, [3] Han, B., Cook, P. and Baldwin, T. (2012): Automatically Constructing a Normalisation Dictionary for Microblogs, In EMNLP-CoNLL 2012, [4] Jurafsky, D. and Martin, J. H. (2009): SPEECH and LANGUAGE PRO- CESSING. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition. [5] Liu, F., Wang, F., Jiang, X. (2012): A broad-coverage normalization system for social media language, In Proceeding ACL 12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, [6] MacNeilage, P. F. (1964): Typing Errors as Clues to Serial Ordering Mechanisms in Language Behaviour, Publication Language and Speech, 7, The processing time of finding all possible candidates when looking at two edits away for the words was around 12 hours. Looking for words that were only one edit away was done in just a couple of minutes. 21 Although it could be the case that the users really meant for example avisen. 10

11 [7] Pedler, J. (2007): Computer Correction of Real-word Spelling Errors in Dyslexic Text, Birkbeck, London University. [8] Peterson, J.L (1980): Computer programs for detecting and correcting spelling errors. Commun. ACM, 23: Online sources Peter Norvig s spell-checker: 11