Database search improvement with Nota s Book Catalogue as a case study

Size: px
Start display at page:

Download "Database search improvement with Nota s Book Catalogue as a case study"

Transcription

1 Database search improvement with Nota s Book Catalogue as a case study Esther Sebastián Liso IT & Cognition University of Copenhagen zxv330@alumni.ku.dk February 11, 2015 Abstract Nota is Denmark s oldest library for people with reading difficulties, especially dyslexics and blind people. However, the search engine on Nota s website is not adapted to the need of the users. This shows in Nota s website statistics by the fact that in many queries users do not find the books they are looking for. The current project analyses the spelling errors of Nota s users, tries to find patterns in these errors made due to the fact that many users are dyslexics. Furthermore, it gives some suggestions to improve Nota s search engine to meet adapt it to the challenges of dyslexics. 1 Introduction Nota is a governmental digital library financed by the Ministry of Culture in Denmark. Nota produces audio books and e-books, which can be fetched on-line. In Denmark, there is not a registration of disabled people, so it is not possible to know exactly how many blind, severely visually impaired and dyslexic people there are. It is estimated by Nota and the Ministry that there are about severely visually impaired people in Denmark, of which most of them are older than 70 years old 1. In addition, it is assumed that Danish are dyslexic 2, mostly young people. Nota has a website with the full catalogue of books that can be downloaded or listened to on-line. This catalogue is called e17 3 and there we can find a normal 1 Nota s blinds members are about Nota has more than dyslexic members 3 1

2 search engine and an advanced search engine to look for specific books. Although people with reading difficulties have problems spelling words, Nota s on-line search engine is not good at recognizing when words are misspelled, so it is very difficult for users to find the books they are looking for 4. The quality of users searches varies a lot from very specific data entries (e.g. Psykiske sygdomme og problemer hos børn og unge, Gyldendal og Politikens Danmarkshistorie: Da Danmark blev Danmark (3) or Kultur og etnicitet på arbejde : professionelt arbejde i det flerkulturelle samfund ) to non-sense strings (e.g. zfb43qns, rtf or Isleifsun ). But what is very prominent is the quantity of spelling mistakes we can find (e.g. for vild med dansk, we can find: vil med dansk, vild med dans, vild med danske, vild men dansk ). In order to enable users to find the books they desire, it is necessary to process their searches and normalise the words that make the search noisy (or ill-formed words), so that they correspond to a word that can be found (or canonical words). Since the ill-formed words vary a lot in their nature, it is not easy to normalise them. A great amount of Nota s users are dyslexics, therefore we will focus on normalising those words that are unintentionally misspelled, like a spell-checker would do 5. Damerau (1964) suggested that most of the errors (80%) that dyslexics made were simple errors, because they were just one edit away from their canonical form. However, other more recent studies show that it might be fewer spelling mistakes (53%) that just differ in one letter (Pedler, 2007). My objective is to find the ill-formed words in users searches and convert them into canonical forms so that the users can find the books they wanted. Since we are dealing with an online book catalogue, we will also find many words that are out-of-vocabulary (OOV) such as proper names (e.g. Harry Potter or Einar Már Gudmundsson ) or made-up words (e.g. Silmarillion ). 2 Related Work There are different approaches to text normalisation that have been used to convert ill-formed words into their canonical forms. Many authors have addressed the problem of text normalization. In 1948, Shannon developed the noisy channel model and, more recently, Brill and Moore (2000), among others, generated a list of corrections for every misspelled word by slicing the strings in different parts and ranked the words in those 4 almost 75% of the searches do not lead to a book according to Nota s statistics. 5 There are many authors that have developed spell-checkers such as Peterson (1980) or Jurafsky and Martin (2009) 2

3 lists by their posterior probabilities of the form P(T S). However, spell-checking has not only been used to retrieve misspelled words in general as Brill and Moore (2000) did, but there has also been made to recover specifically dyslexic misspelled words (Pedler, 2007). An aspect that we need to take into consideration when finding misspelled words in our data set is that often they will be hapaxes 6, so it is not very probable that the same mistakes are repeated in the future. So, just having a human annotating a certain amount of words (even a big amount) or a list of corrections with probabilities might not help us to find the correct word. According to Damerau (1964) and Pedler (2007), people who suffer dyslexia often make the same type of spelling mistakes. These errors can be classified as follows: substitution, as puramide, japannk ); insertion, such as sidste timme or matrematik ; omission, some examples are fisk ri and muslimske rel gion ; and transposition, such as salg og servcie or poilti ). In recent years, many authors including Cook and Stevenson (2009) and Choudhury et al. (2007) applied hidden Markov model state transitions and emissions to the task of normalisation. However, since we are just looking at single words in most cases instead of sentences, we will not consider applying the hidden Markov model for this purpose. We will follow the instructions for the experiments that Han et al. (2011) carried out. We will conduct a normalization task in single words and see the distribution of errors in the data set we have selected from Nota s book catalogue. We will detect all ill-formed words and generate a confusion set with different numbers of candidates for every OOV word. Finally, we will select one of these candidates as the word that the person most probably intended to spell. Han et al. (2011) found that word similarity achieved higher precision and recall than context support, which supports the idea of Damerau (1964) and Pedler (2007) of most of the errors emanate of morphophonemic variations, namely substitution, insertion, omission and transposition among others. Hat et al. (2011) found that the best combination to normalise words was combining dictionary lookup in a first place and then checking word similarity transformations and context support. However, due to the nature of our data set, we will not take into account the support of the context, since many words are proper names or have no context at all. Furthermore, Hat et al. (2011) found many limitations when handling context features, since the context used to be very noisy and it was not possible to extract information to normalise ill-formed words. We will also apply the method of Han and Baldwin (2012): using a lexical normalization dictionary with canonical forms. This dictionary will be constructed 6 in other words, we will just find these words once in our corpus 3

4 by the titles and authors of Nota s book catalogue. This catalogue contains all the possible words users can find in Nota s search engine, namely titles and authors of their books. This dictionary will be used as reference to detect the ill-formed words by checking whether the words that will be generated in the confusion set for every ill-formed word are in the normalisation dictionary. For this task, we will only select the words that are in the normalisation dictionary as possible candidates and not all generated words from the confusion set. 3 Method The method we applied to normalise the ill-formed words and find their canonical forms includes the following steps: Firstly, we construct a lexical normalisation dictionary with canonical forms using Nota s book catalogue (titles and authors) as a reference. Secondly, we extract all kinds of keywords that users typed in Nota s search engine in order to construct a data set of words to be spell-checked. Then, we look for the OOV words by comparing the instances in our data set with those in Nota s book catalogue (or normalisation dictionary). Those words that are not in the normalisation dictionary are in the ill-formed words group. For these words, we generate a confusion set by adding all possible combinations following the most common spelling mistakes by people who suffer from dyslexia (by transposition, insertion, omission and substitution of one or two letters in the same word). Finally, we selected the best candidate by 1) only taking into account those candidates that are in the normalisation dictionary and 2) selecting the candidate that has the maximum probability to be the canonical form of the ill-formed word. 3.1 Construction of the lexical normalisation dictionary Even though it has been argued that dictionary lookup approaches to normalisation show high precision but low recall (Han and Baldwin, 2011, Han et al among others), we used this method since my dictionary already contains the full set of canonical words. The reason why we chose to do this is that all the possible searches that will lead to a book should appear in the lexical normalisation dictionary (or Nota s book catalogue) and those words, although correctly spelled words (false negatives), that do not appear in Nota s database, will also lead to an error in users searches. Considering ill-formed words, all words that do not appear in Nota s book catalogue (although they are not ill-formed), precision will be 100%. In addition, since every token instance of a given type is always normalised 4

5 into the same word, it is said that dictionary lookup is a type-based approach to normalisation (Cook and Stevenson, 2009 and Han et al. 2012). This means that an ill-formed word will always be corrected into the same word. If we take into account that most of the errors that appear in the data set are dyslexic errors and, therefore, unintentional errors 7, it should not be problematic that a string like sgole is in all contexts normalised into the word skole and not sole. The input for the lexical normalisation dictionary comprises titles and authors of the books from Nota s book catalogue. These titles and authors have been tokenised and lower-cased. Furthermore, stop words have been excluded. The output is the list of words that is being used to check whether users keywords in Nota s search engine are correctly spelled or not. This means that, if the words are correctly spelled, they will appear in the normalisation dictionary. If they are not correctly spelled, they will not be found in the normalisation dictionary. 3.2 Construction of the data set We extracted a corpus of keywords from Google Analytics 8. The corpus includes searches from January 2014 to September All keywords have been lower-cased and tokenised. Those searches that contained the same words, for example En flænge i himlen, were grouped 9, so that it was not necessary to normalise several times the same words. Figure 1 shows the frequency distribution of the 200 most common searches in our data set 10. Most searches were, therefore, unique. This is why training a system in our data set with probabilities would not be very useful in this context, since most of the words would not be seen during a training. Bearing in mind limited time to process all the information, we randomly selected of the keywords to work with in the following steps Ill-formed word detection Once the data set is created with keywords from users, we need to identify which words are ill-formed and need to be normalised. All the words that can be considered ill-formed are those that are not in the dictionary. Furthermore, following Han and Baldwin (2011), we normalised only single-token words. In practice, this means that keywords like vampyr kongen (split words) or småkillinger 7 In contrast with abbreviations, words to express sentiment, etc. that tend to be intentional and more difficult to normalise There were actually no misspelled words in the most common searches. 10 These searches took place at least twice. 5

6 Figure 1: Frequency distribution of the 200 most common searches from the different keywords typed in Nota s search engine. Those that are not shown in the graph had just two or one instances. (run-outs) will not be able to be normalised since the error is out of the word boundary 11. Figures from Google Analytics show that in the same period as the keywords were extracted, only 28.94% of the users were able to find a book in Nota s book catalogue 12 after their first search and, that an extra 19.63% were able to find a book after their second search. This means that most of the users have not found the book they desired after two searches in Nota s search engine 13. We compared the keywords from my data set with the words from the normalisation dictionary. Those words that matched in the dictionary where considered correctly spelled. Those words that did not have a corresponding word in the dictionary could belong to two different categories: it either means that 1) the word is spelled incorrectly or 2) the search words are correctly spelled, but the user looked for a book which did not exist in Nota s book catalogue. In both cases, the words were analysed and a confusion set was generated for them. 11 Following Pedler (2007), this kind of infractions are only 8% of the dyslexic errors. 12 We cannot be sure whether they found the book they desired or just another book. 13 We must also take into account that Nota offers the possibility to search books through Google s search engine, which already gives some suggestions to users when they misspell a word. This fact does increase the success rate of finding a book in the statistics of Google Analytics. 6

7 3.2.2 Confusion set generation The confusion set is a list of generated strings from an ill-formed keyword (by transposition, insertion, omission and substitution of one or two letters in the same word), whose aim is to find the correct form of the keyword and correct it. These strings included as well one as two characters edit distance of the given word. All possible combinations 14 are taken into consideration. Nevertheless, only those candidates that are found in the dictionary are added to the confusion set. After one edit, many words had no candidate, which means that from all possible combinations, none of the strings belonged to the dictionary, so the word remained unchanged. However, almost half of the words only had one candidate and some words had several candidates, reaching up to 15 candidates 15. For those words that after one edit had no candidates, we looked at two edits of distance. This did not help much since most of them kept having no candidates and many words, especially short words, had too many candidates (up to 142 in one case). Therefore, the probabilities to pick the correct canonical form of these kinds of word were very low Candidate selection From the confusion set, we need to select the best candidate word, which is the candidate word that will replace the ill-formed word and should be suggested to the user in the first place. This candidate should be the word that the user intended to spell. When there is only a word, it is that word that is suggested to replace the original ill-formed word. However, when there are more than one word, we use the method max(), which returns the largest item in an iterable or the largest of two or more arguments 16. In the case that there is more than one candidate, the first candidate suggested is selected and should be presented to the user. 4 Analysis and Results From the keywords that were taken into consideration for the analysis, were found in Nota s book catalogue and had no equivalents. These words were considered incorrect keywords since these searches did not 14 These combinations are created taking into account all letters in the alphabet, no number or special characters are included. 15 for example, bood had the following candidates: wood, good, bold, boyd, food, blod, blood, bodo, boo, bord, bod, book, hood, bond, all appeared in Nota s book catalogue 16 retrieved from on February, the 1st. 7

8 lead the users to a book. From these incorrect words, words (44.39%) that consisted of three or more letters, could be corrected after using a spell-checker. Meanwhile, words (55.61%) were not found after two edits in the spelling of the word. However, many three-letter words were not as reliable as longer words. There were many examples of three-letter keywords that had not apparent meaning (e.g. vbk, flr or sdf). These were corrected and thus changed to other words (e.g. vik, for and adf). Despite this correction, these words were considered correct, since they were found in Nota s book catalogue. However, there is a high probability that it was not those words that the user intended to spell. If we look at words that contained four letters or more, the results are slightly worse. Only 39.56% of the words were corrected after one edit and 40.17% after two edits. We can see in Table 1 that the F-score is not very good. However, if we compare the score with Han and Baldwin s (2011) word similarity and context support results, our results are a bit better. In contrast, if we compare these results with their dictionary lookup results (or dictionary lookup, word similarity and context support), these results are not as good as theirs. 3+ letters 4+ letters after 1 edit after 2 edits after 1 edit after 2 edits precision recall F-score Table 1: Precision, recall and F-score depending of the length of the words taken into account and the amount of edits. We can see in Table 1 that the results do not improve much after two edits in comparison with the words that were just edited once. In addition, these correction tasks provided such a big amount of possible candidates that many of the chosen words might be false positives. As we can see in Table 2, most of the errors (79.2%) were corrected after a replacing rule, where one letter of the word was replaced with another letter (e.g. puramide, harre potter, filisofi or japannk ). There were also some examples of transposes (16.6%) where two letters could be interchanged (e.g. salg og servcie, poilti or natruvidenskab ) and not so many examples where we could find some inserts or deletes (2.1% respectively) where users inserted an extra letter (e.g. sidste timme, civile retspleje, anderkende or matrematik ) or forgot a letter ( fisk ri, muslimske rel gion, samfun sfag or forskning metode ). It was not possible to find a correct word for more than half of the words (55.61%) that were considered ill-formed. However, looking more deeply into the 8

9 type of error number of corrected words percentage of errors of this kind substitution % transposition % insertion % omission % Table 2: Number and percentage of the most common dyslexic errors in users of e17 when typing the books they were looking for in Nota s search engine. kind of mistakes users made, it was very common 17 that they wrote a compounded word into two different words (e.g. sundheds sociologi, vampyr kongen or skygge porten ). Other keywords that could not be corrected were actually numbers (probably books IDs, which were not included in Nota s book catalogue) A problem of this method was how to decide which of the candidate was the best one when there were many candidates,. For example, we could find the illformed word ote, which was not found in Nota s book catalogue. This word was corrected to ode by substitution of one letter. However, it can be discussed whether the user meant ode or otte 18, which was other of the possible candidates. The max() method in these cases preferred the words derived by substitution instead of the other methods and, as we can see in Table 2, there are also more corrections made by substitution than by other derivations. 5 Conclusions and discussion I have proposed a method for Nota to improve user s experience when searching for books in Nota s book catalogue. This method consists of comparing user s keywords with words in Nota s book catalogue and check whether they were found there or not. If they were not found, there could be some suggestions based on the words that were just one letter of distance away from a word that is in the database. An advantage of this method is that it does not require explicit annotations 19. The most interesting conclusion of this project is that, since people who suffer from dyslexia often make the same kind of mistakes, the processes that can be 17 There were errors of this type, which is 15.88% of the words that were not corrected or 8.85% of the total errors. 18 derived by omission of one letter. 19 The method needs to take into account that most of the words that were out-of-vocabulary and, therefore, considered as ill-formed were existing titles and authors of books, were because Nota did not possess the audio book version of the book with that title. In fact, there were many very specialised books both in English and Danish in the user s searches. Many of these books were not in Nota s book catalogue. 9

10 followed are these four: 1) transposition of two letters, 2) omission, 3) insertion or 4) substitution of just one letter, since there were not many ill-formed words that were more than one edit away from their canonical form 20.. In addition, it is also interesting to check whether there was an extra space between two words, in order to supplement the other four processes, since there were more mistakes of this kind than expected. A problem of our method was similar to the problem experienced by Han et al. (2011), namely that if correct out-of-vocabulary words were identified as illformed, the candidate selection step could never be done correctly. In other words, words like aviser that were not found in the dictionary although they were feasible words, had different candidates, such as avisen or viser and were therefore incorrect after the candidate selection in all cases 21. References [1] Cook, P., Stevenson, S. (2009): An unsupervised model for tet message normalization. In CALC 09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, [2] Han, B., Baldwin, T. (2011): Lexical normalisation of short text messages: makn sens a twitter, In Proceeding HLT 11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, [3] Han, B., Cook, P. and Baldwin, T. (2012): Automatically Constructing a Normalisation Dictionary for Microblogs, In EMNLP-CoNLL 2012, [4] Jurafsky, D. and Martin, J. H. (2009): SPEECH and LANGUAGE PRO- CESSING. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition. [5] Liu, F., Wang, F., Jiang, X. (2012): A broad-coverage normalization system for social media language, In Proceeding ACL 12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, [6] MacNeilage, P. F. (1964): Typing Errors as Clues to Serial Ordering Mechanisms in Language Behaviour, Publication Language and Speech, 7, The processing time of finding all possible candidates when looking at two edits away for the words was around 12 hours. Looking for words that were only one edit away was done in just a couple of minutes. 21 Although it could be the case that the users really meant for example avisen. 10

11 [7] Pedler, J. (2007): Computer Correction of Real-word Spelling Errors in Dyslexic Text, Birkbeck, London University. [8] Peterson, J.L (1980): Computer programs for detecting and correcting spelling errors. Commun. ACM, 23: Online sources Peter Norvig s spell-checker: 11

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

Problems with the current speling.org system

Problems with the current speling.org system Problems with the current speling.org system Jacob Sparre Andersen 22nd May 2005 Abstract We out-line some of the problems with the current speling.org system, as well as some ideas for resolving the problems.

More information

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect

More information

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Search Query and Matching Approach of Information Retrieval in Cloud Computing International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval

More information

Typo-Squatting: The Curse of Popularity

Typo-Squatting: The Curse of Popularity Typo-Squatting: The Curse of Popularity Alessandro Linari 1,2 Faye Mitchell 1 David Duce 1 Stephen Morris 2 1 Oxford Brookes University, 2 Nominet UK {alinari,frmitchell,daduce}@brookes.ac.uk, stephen.morris@nominet.org.uk

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Introduction. Chapter 1. 1.1 Introduction. 1.2 Background

Introduction. Chapter 1. 1.1 Introduction. 1.2 Background Chapter 1 Introduction 1.1 Introduction This report is the fifth in the series describing experiments with the British Library Research & Development Department's (BLR&DD) Okapi bibliographic information

More information

User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary

User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary Henrik Lorentzen, Lars Trap-Jensen Society for Danish Language and Literature, Copenhagen, Denmark E-mail:

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language LUCS Minor 16, 2010. ISSN 1104-1609. ChildFreq: An Online Tool to Explore Word Frequencies in Child Language Rasmus Bååth Lund University Cognitive Science Kungshuset, Lundagård, 222 22 Lund rasmus.baath@lucs.lu.se

More information

Best Practices: ediscovery Search

Best Practices: ediscovery Search Best Practices: ediscovery Search Improve Speed and Accuracy of Reviews & Productions with the Latest Tools February 27, 2014 Karsten Weber Principal, Lexbe LC ediscovery Webinar Series Info & Future Takes

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Text Analytics Illustrated with a Simple Data Set

Text Analytics Illustrated with a Simple Data Set CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

More information

Understanding Learners Cognitive Abilities: A Model of Mobilizing Non-English Majors Cognitive Abilities in the Process of Their Writing in English

Understanding Learners Cognitive Abilities: A Model of Mobilizing Non-English Majors Cognitive Abilities in the Process of Their Writing in English Understanding Learners Cognitive Abilities: A Model of Mobilizing Non-English Majors Cognitive Abilities in the Process of Their Writing in English Liu Fengming Southwest Normal University Introduction

More information

Good Job! This URL received a A grade. Factor Overview. On-Page Keyword Usage for: modular offices

Good Job! This URL received a A grade. Factor Overview. On-Page Keyword Usage for: modular offices Good Job! This URL received a A grade After analyzing your page for the supplied keyword's prominence, we issue your page a letter grade (e.g. an A would mean that your keyword appears in 90-00% of our

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

Writing Effective Questions

Writing Effective Questions Writing Effective Questions The most important thing to keep in mind when you develop test questions is that your job as an educator is to teach people so that they can learn and be successful. The idea

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

On generating large-scale ground truth datasets for the deduplication of bibliographic records

On generating large-scale ground truth datasets for the deduplication of bibliographic records On generating large-scale ground truth datasets for the deduplication of bibliographic records James A. Hammerton j_hammerton@yahoo.co.uk Michael Granitzer mgrani@know-center.at Maya Hristakeva maya.hristakeva@mendeley.com

More information

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination. GCE Computing COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination 2510 Summer 2014 Version: 1.0 Further copies of this Report are available from aqa.org.uk

More information

Towards Unsupervised Word Error Correction in Textual Big Data

Towards Unsupervised Word Error Correction in Textual Big Data Towards Unsupervised Word Error Correction in Textual Big Data Joao Paulo Carvalho 1 and Sérgio Curto 1 1 INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, Portugal

More information

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

Charles Darwin University Library Client Survey Report

Charles Darwin University Library Client Survey Report Charles Darwin University Library Client Survey Report May 2010 Insync Surveys Pty Ltd Melbourne Phone: +61 3 9909 9209 Fax: +61 3 9614 4460 Sydney Phone: +61 2 8081 2000 Fax: +61 2 9955 8929 Perth Phone:

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

2014/02/13 Sphinx Lunch

2014/02/13 Sphinx Lunch 2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue

More information

Programming Exercises

Programming Exercises s CMPS 5P (Professor Theresa Migler-VonDollen ): Assignment #8 Problem 6 Problem 1 Programming Exercises Modify the recursive Fibonacci program given in the chapter so that it prints tracing information.

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

A Survey on Product Aspect Ranking

A Survey on Product Aspect Ranking A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,

More information

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter

More information

ANALYZING DATA USING TRANSANA SOFTWARE FOR INTERACTION IN COMPUTER SUPPORT FACE-TO-FACE COLLABORATIVE LEARNING (COSOFL) AMONG ESL PRE-SERVIVE TEACHER

ANALYZING DATA USING TRANSANA SOFTWARE FOR INTERACTION IN COMPUTER SUPPORT FACE-TO-FACE COLLABORATIVE LEARNING (COSOFL) AMONG ESL PRE-SERVIVE TEACHER 11 ANALYZING DATA USING TRANSANA SOFTWARE FOR INTERACTION IN COMPUTER SUPPORT FACE-TO-FACE COLLABORATIVE LEARNING (COSOFL) AMONG ESL PRE-SERVIVE TEACHER Abdul Rahim Hj Salam 1 Assoc. Prof Dr Zaidatun Tasir

More information

Design guide. Design denmark. Design guide Version 1.0 November 2014

Design guide. Design denmark. Design guide Version 1.0 November 2014 Design guide Design guide Version 1.0 November 2014 is an alliance of designers, design thinkers and design businesses, working Design guide About the new Design denmark visual identity The new visual

More information

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured

More information

Office of History. Using Code ZH Document Management System

Office of History. Using Code ZH Document Management System Office of History Document Management System Using Code ZH Document The ZH Document (ZH DMS) uses a set of integrated tools to satisfy the requirements for managing its archive of electronic documents.

More information

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

More information

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share. LING115 Lecture Note Session #4 Python (1) 1. Introduction As we have seen in previous sessions, we can use Linux shell commands to do simple text processing. We now know, for example, how to count words.

More information

Reputation Management System

Reputation Management System Reputation Management System Mihai Damaschin Matthijs Dorst Maria Gerontini Cihat Imamoglu Caroline Queva May, 2012 A brief introduction to TEX and L A TEX Abstract Chapter 1 Introduction Word-of-mouth

More information

Abstract. Description

Abstract. Description Project title: Bloodhound: Dynamic client-side autocompletion features for the Apache Bloodhound ticket system Name: Sifa Sensay Student e-mail: sifasensay@gmail.com Student Major: Software Engineering

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Unit 5.1 The Database Concept

Unit 5.1 The Database Concept Unit 5.1 The Database Concept Candidates should be able to: What is a Database? A database is a persistent, organised store of related data. Persistent Data and structures are maintained when data handling

More information

Electronic Document Management Using Inverted Files System

Electronic Document Management Using Inverted Files System EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,

More information

Resolving Common Analytical Tasks in Text Databases

Resolving Common Analytical Tasks in Text Databases Resolving Common Analytical Tasks in Text Databases The work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B. Database Systems and Text-based Information

More information

Text Processing (Business Professional)

Text Processing (Business Professional) Text Processing (Business Professional) Unit Title: Medical Audio-Transcription OCR unit number: 06995 Level: 2 Credit value: 5 Guided learning hours: 50 Unit reference number: A/505/7087 Unit aim This

More information

USERV Auto Insurance Rule Model in Corticon

USERV Auto Insurance Rule Model in Corticon USERV Auto Insurance Rule Model in Corticon Mike Parish Progress Software Contents Introduction... 3 Vocabulary... 4 Database Connectivity... 4 Overall Structure of the Decision... 6 Preferred Clients...

More information

GDP11 Student User s Guide. V. 1.7 December 2011

GDP11 Student User s Guide. V. 1.7 December 2011 GDP11 Student User s Guide V. 1.7 December 2011 Contents Getting Started with GDP11... 4 Program Structure... 4 Lessons... 4 Lessons Menu... 4 Navigation Bar... 5 Student Portfolio... 5 GDP Technical Requirements...

More information

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Twitter Stock Bot John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Hassaan Markhiani The University of Texas at Austin hassaan@cs.utexas.edu Abstract The stock market is influenced

More information

Learning Disabilities. Strategies for the classroom

Learning Disabilities. Strategies for the classroom Learning Disabilities Strategies for the classroom A learning disability is a neurological condition that interferes with a person s ability to store, process or produce information. Common Disabilities

More information

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Presented By: Andrew McMurry & Britt Fitch (Apache ctakes committers) Co-authors: Guergana Savova, Ben Reis,

More information

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

More information

Innovation with the hyper library

Innovation with the hyper library Innovation with the hyper library Innovation is one of three top priorities of Aalborg University 1, as it appears in the development contract between Aalborg University and the Danish ministry of Science,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Text Processing (Business Professional)

Text Processing (Business Professional) Unit Title: Audio-Transcription OCR unit number: 06976 Level: 2 Credit value: 4 Guided learning hours: 40 Unit reference number: F/505/7088 Unit aim Text Processing (Business Professional) This unit aims

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

RRSS - Rating Reviews Support System purpose built for movies recommendation

RRSS - Rating Reviews Support System purpose built for movies recommendation RRSS - Rating Reviews Support System purpose built for movies recommendation Grzegorz Dziczkowski 1,2 and Katarzyna Wegrzyn-Wolska 1 1 Ecole Superieur d Ingenieurs en Informatique et Genie des Telecommunicatiom

More information

Research on Sentiment Classification of Chinese Micro Blog Based on

Research on Sentiment Classification of Chinese Micro Blog Based on Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: 8e8@163.com Abstract

More information

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Taxonomies in Practice Welcome to the second decade of online taxonomy construction Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Introduction to Python

Introduction to Python WEEK ONE Introduction to Python Python is such a simple language to learn that we can throw away the manual and start with an example. Traditionally, the first program to write in any programming language

More information

The Design of a Proofreading Software Service

The Design of a Proofreading Software Service The Design of a Proofreading Software Service Raphael Mudge Automattic Washington, DC 20036 raffi@automattic.com Abstract Web applications have the opportunity to check spelling, style, and grammar using

More information

User Manual. Learning Management System COMSATS Virtual Campus

User Manual. Learning Management System COMSATS Virtual Campus User Manual Learning Management System COMSATS Virtual Campus Table of Contents Overview... 3 The LMS Home Screen... 4 The Main Menu bar... 4 1. LMS Home:... 4 2. About LMS:... 4 3. Contacts:... 4 4. Login

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

A New Web Site Startup Checklist aka Scott Fox s Twelve Step Program for Setting Up a New Web Site

A New Web Site Startup Checklist aka Scott Fox s Twelve Step Program for Setting Up a New Web Site INTERNET RICHES The Simple Money-making Secrets of Online Millionaires By Scott Fox American Management Association (AMACOM) - ISBN: 978-0814473563 A New Web Site Startup Checklist aka Scott Fox s Twelve

More information

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment 2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

More information

HPI in-memory-based database system in Task 2b of BioASQ

HPI in-memory-based database system in Task 2b of BioASQ CLEF 2014 Conference and Labs of the Evaluation Forum BioASQ workshop HPI in-memory-based database system in Task 2b of BioASQ Mariana Neves September 16th, 2014 Outline 2 Overview of participation Architecture

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Python Loops and String Manipulation

Python Loops and String Manipulation WEEK TWO Python Loops and String Manipulation Last week, we showed you some basic Python programming and gave you some intriguing problems to solve. But it is hard to do anything really exciting until

More information

Writing Style Guide Updated January 2015

Writing Style Guide Updated January 2015 Writing Style Guide Updated January 2015 1 Introduction The Wentworth Institute of Technology Writing Style Guide includes information not only on style rules particular to Wentworth, but also commonly

More information

How To Find Out What Political Sentiment Is On Twitter

How To Find Out What Political Sentiment Is On Twitter Predicting Elections with Twitter What 140 Characters Reveal about Political Sentiment Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, Isabell M. Welpe Workshop Election Forecasting 15 July 2013

More information

NetOwl(TM) Extractor Technical Overview March 1997

NetOwl(TM) Extractor Technical Overview March 1997 NetOwl(TM) Extractor Technical Overview March 1997 1 Overview NetOwl Extractor is an automatic indexing system that finds and classifies key phrases in text, such as personal names, corporate names, place

More information

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture A prototype infrastructure for D Spin Services based on a flexible multilayer architecture Volker Boehlke 1,, 1 NLP Group, Department of Computer Science, University of Leipzig, Johanisgasse 26, 04103

More information

Python Lists and Loops

Python Lists and Loops WEEK THREE Python Lists and Loops You ve made it to Week 3, well done! Most programs need to keep track of a list (or collection) of things (e.g. names) at one time or another, and this week we ll show

More information

Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text

Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text Matthew Cooper FX Palo Alto Laboratory Palo Alto, CA 94034 USA cooper@fxpal.com ABSTRACT Video is becoming a prevalent medium

More information

DYNAMIC QUERY FORMS WITH NoSQL

DYNAMIC QUERY FORMS WITH NoSQL IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI

RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI RA MODEL VISUALIZATION WITH MICROSOFT EXCEL 2013 AND GEPHI Prepared for Prof. Martin Zwick December 9, 2014 by Teresa D. Schmidt (tds@pdx.edu) 1. DOWNLOADING AND INSTALLING USER DEFINED SPLIT FUNCTION

More information

How To Use Gps Navigator On A Mobile Phone

How To Use Gps Navigator On A Mobile Phone Software Requirements Specification Amazing Lunch Indicator Sarah Geagea 881024-4940 Sheng Zhang 850820-4735 Niclas Sahlin 880314-5658 Faegheh Hasibi 870625-5166 Farhan Hameed 851007-9695 Elmira Rafiyan

More information

1 Which of the following questions can be answered using the goal flow report?

1 Which of the following questions can be answered using the goal flow report? 1 Which of the following questions can be answered using the goal flow report? [A] Are there a lot of unexpected exits from a step in the middle of my conversion funnel? [B] Do visitors usually start my

More information

8 Simple Things You Might Be Overlooking In Your AdWords Account. A WordStream Guide

8 Simple Things You Might Be Overlooking In Your AdWords Account. A WordStream Guide 8 Simple Things You Might Be Overlooking In Your AdWords Account A WordStream Guide 8 Simple Things You Might Be Overlooking In Your AdWords Account AdWords makes it incredibly easy to set up and run a

More information

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A Introduction to IR Systems: Supporting Boolean Text Search Chapter 27, Part A Database Management Systems, R. Ramakrishnan 1 Information Retrieval A research field traditionally separate from Databases

More information

Cross-lingual Synonymy Overlap

Cross-lingual Synonymy Overlap Cross-lingual Synonymy Overlap Anca Dinu 1, Liviu P. Dinu 2, Ana Sabina Uban 2 1 Faculty of Foreign Languages and Literatures, University of Bucharest 2 Faculty of Mathematics and Computer Science, University

More information

Anotaciones semánticas: unidades de busqueda del futuro?

Anotaciones semánticas: unidades de busqueda del futuro? Anotaciones semánticas: unidades de busqueda del futuro? Hugo Zaragoza, Yahoo! Research, Barcelona Jornadas MAVIR Madrid, Nov.07 Document Understanding Cartoon our work! Complexity of Document Understanding

More information

C A R I B B E A N E X A M I N A T I O N S C O U N C I L REPORT ON CANDIDATES WORK IN THE SECONDARY EDUCATION CERTIFICATE EXAMINATION MAY/JUNE 2011

C A R I B B E A N E X A M I N A T I O N S C O U N C I L REPORT ON CANDIDATES WORK IN THE SECONDARY EDUCATION CERTIFICATE EXAMINATION MAY/JUNE 2011 C A R I B B E A N E X A M I N A T I O N S C O U N C I L REPORT ON CANDIDATES WORK IN THE SECONDARY EDUCATION CERTIFICATE EXAMINATION MAY/JUNE 2011 ECONOMICS GENERAL PROFICIENCY EXAMINATION Copyright 2011

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

Glossary of translation tool types

Glossary of translation tool types Glossary of translation tool types Tool type Description French equivalent Active terminology recognition tools Bilingual concordancers Active terminology recognition (ATR) tools automatically analyze

More information

Kaspersky Whitelisting Database Test

Kaspersky Whitelisting Database Test Kaspersky Whitelisting Database Test A test commissioned by Kaspersky Lab and performed by AV-Test GmbH Date of the report: February 14 th, 2013, last update: April 4 th, 2013 Summary During November 2012

More information