Unsupervised Anomaly Detection
|
|
|
- Brett Lloyd
- 9 years ago
- Views:
Transcription
1 Unsupervised Anomaly Detection David Guthrie, Louise Guthrie, Ben Allison, Yorick Wilks University of Sheffield Natural Language Processing Group Regent Court, 211 Portobello St., Sheffield, England S1 4DP {dguthrie, louise, ben, Abstract This paper describes work on the detection of anomalous material in text. We show several variants of an automatic technique for identifying an 'unusual' segment within a document, and consider texts which are unusual because of author, genre [Biber, 1998], topic or emotional tone. We evaluate the technique using many experiments over large document collections, created to contain randomly inserted anomalous. In order to successfully identify anomalies in text, we define more than 200 stylistic features to characterize writing, some of which are well-established stylistic determiners, but many of which are novel. Using these features with each of our methods, we examine the effect of segment size on our ability to detect anomaly, allowing of size 100 words, 500 words and 1000 words. We show substantial improvements over a baseline in all cases for all methods, and identify the method variant which performs consistently better than others. 1 Introduction Anomaly detection refers to the task of identifying documents, or of text, that are unusual or different from normal text. Previous work on anomaly detection (sometimes called novelty detection) has assumed the existence of a collection of data that defines "normal" [e.g. Allison and Guthrie, 2006; Markou and Singh, 2003], which was used to model the normal population, and methods were developed to identify data that differs significantly from this model. As an extension to some of that work, this paper describes a more challenging anomaly detection scenario, where we assume that we have no data with which to characterize "normal" language. The techniques used for this task do not make use of any training data for either the normal or the anomalous populations and so are referred to as unsupervised. In this unsupervised scenario of anomaly detection, the task is to find which parts of a collection or document are most anomalous with respect to the rest of the collection. For instance, if we had a collection of news stories with one fictional story inserted, we would want to identity this fictional story as anomalous, because its language is anomalous with respect to the rest of the documents in the collection. In this example we have no prior knowledge or training data of what it means to be "normal", nor what it means to be news or fiction. As such, if the collection were switched to be mostly fiction stories and one news story then we would hope to identify the news story as anomalous with respect to the rest of the collection because it is a minority occurrence. There is a very strong unproven assumption that what is artificially inserted into a document or collection will be the most anomalous thing within that collection. While this might not be true in the general case, every attempt was made to ensure the cohesiveness of the collections before insertion to minimize the of finding genuine, unplanned anomalies. In preliminary experiments where a genuine anomaly did exist (for example, a large table or list), it was comforting to note that these sections were identified as anomalous. The focus of this paper is the identification of in a document that are anomalous. Identifying anomalies in is difficult because it requires sufficient repetition of phenomena found in small amounts of text. This segment level concentration steered us to make choices and develop techniques that are appropriate for characterizing and comparing smaller. There are several possibilities for the types of anomaly that might occur at the segment level. One simple situation is an off-topic discussion, where an advertisement or spam is inserted into a topic specific bulletin board. Another possibility is that one segment has been written by a different author from the rest of the document, as in the case of plagiarism. Plagiarism is notoriously difficult to detect when the source of the plagiarism can be found (using a search engine like Google or by comparison to the work of other students or writers.) In addition, the plagiarized are likely to be on the same topic as the rest of the document, so lexical choice often does not help to differentiate them. It is also possible for a segment to be anomalous because of a change in tone or attitude of the writing. The goal of this work is to develop a technique that will detect an anomalous segment in text without knowing in advance the kind of anomaly that is present. 1624
2 Unsupervised detection of small anomalous can not depend on the strategies for modeling language that are employed when training data is available. With a large amount of training data, we can build up an accurate characterization of the words in a document. These languagemodeling techniques make use of the distribution of the vocabulary in a document and, because language use and vocabulary are so diverse, it is necessary to train on a considerable amount of data to see the majority of cases (of any specific phenomenon) that might occur in a new document. If we have a more limited amount of data available, as in the of a document, it is necessary to characterize the language using techniques that are less dependent on the actual distribution of words in a document and thus less affected by the sparseness of language. In this paper we make use of techniques that employ some level of abstraction from words and focus on characterizing style, tone, and classes of lexical items. We approach the unsupervised anomaly detection task slightly differently than we would if we were carrying out unsupervised classification of text [Oakes, 1998; Clough, 2000]. In unsupervised classification (or clustering) the goal is to group similar objects into subsets; but in unsupervised anomaly detection we are interested in determining which are most different from the majority of the document. The techniques used here do not assume anomalous will be similar to each other: therefore we have not directly used clustering techniques, but rather developed methods that allow many different types of anomalous within one document or collection to be detected. 2 Characterizing Segments of Text The use of statistical methods with simple stylistic measures [Milic, 1967, 1991; Kenny, 1982] has proved effective in many areas of natural language processing such as genre detection [Kessler et al., 1997; Argamon et al., 1998; Maynard et al., 2001], authorship attribution [McEnery and Oakes, 2000; Clough et al., 2002; Wilks 2004], and detecting stylistic inconsistency [Glover and Hirst 1996; Morton, 1978, McColly, 1978; Smith, 1998]. Determining which stylistic measures are most effective can be difficult, but this paper uses features that have proved successful in the literature, as well as several novel features thought to capture the style of a segment of text. For the purposes of this work, we represent each segment of text as a vector, where the components of the vector are based on a variety of stylistic features. The goal of the work is to rank each segment based on how much it differs from the rest of the document. Given a document of n, we rank each of these from one to n based on how dissimilar they are to all other in the document and thus by their degree of anomaly. Components of our vector representation for a segment consist of simple surface features such as Average word and average sentence length, the average number of syllables per word, together with a range of Readability Measures [Stephens, 2000] such as Flesch-Kincaid Reading Ease, Gunning-Fog Index, and SMOG Index, some vocabulary richness measures such as: the percentage of words that occur only once, percentage of words which are among the most frequent words in the Gigaword newswire corpus (10 years of newswire), as well as the features described below. All are passed through the RASP (Robust and Accurate Statistical Parser) part-of-speech tagger developed at the Universities of Sussex and Cambridge. Words, symbols and punctuation are tagged with one of 155 part-ofspeech tags from the CLAWS 2 tagset. We use this markup to compute features that capture the distribution of part of speech tags. The representation of a segment includes the i) Percentages of words that are articles, prepositions, pronouns, conjunction, punctuation, adjectives, and adverbs ii) The ratio of adjectives to nouns iii) sentences that begin with a subordinating or coordinating conjunctions (but,so,then,yet,if,because,unless, or ) iv) Diversity of POS tri-grams - this measures the diversity in the structure of a text (number of unique POS trigrams divided by the total number of POS trigrams) Texts are also run through the RASP morphological analyzer, which produces words, lemmas and inflectional affixes. These are used to compute the i) passive sentences ii) nominalizations. 2.1 Rank Features Authors can often be distinguished by their preference for certain prepositions over others or their reliance on specific grammatical constructions. We capture these preferences by keeping a ranked list sorted by frequency of the: i) Most frequent POS tri-grams list ii) Most frequent POS bi-gram list iii) Most frequent POS list iv) MostfrequentArticleslist v) Most frequent Prepositions list vi) Most frequent Conjunctions list vii) Most frequent Pronouns list For each segment, the above lists are created both for the segment and for the complement of that segment. We use 1 r (where r is the Spearman Rank Correlation coefficient) to indicate the dissimilarity of each segment to its complement. 2.2 Characterizing Tone The General Inquirer Dictionary ( developed by the social science department at Harvard, contains mappings from words to social science content-analysis categories. These content-analysis categories attempt to capture the tone, attitude, outlook, or perspective of text and can be an important signal of anomaly. The Inquirer dictionary consists of 7,800 words mapped into 182 categories with most words assigned to more than one category. The two largest categories are positive and negative which account for 1,915 and 2,291 words respectively. Also included are all Har- 1625
3 vard IV-4 and Lasswell categories. We make use of these categories by keeping track of the percentage of words in a segment that fall into each category. 3 Method All experiments presented here are performed by characterizing each segment as well as characterizing the complement of that segment (the union of the remaining ). This involves constructing a vector of features for each segment and a vector of features for each segment's complement as well as a vector of lists (see Rank Features section) for each segment and its complement. So, for every segment in a document we have a total of 4 vectors: V1 - feature vector characterizing the segment V2 - feature vector characterizing the complement of the segment V3 - vector of lists for all rank features for the segment V4 - vector of lists for all rank features for the complement of the segment We next create a vector of list scores called V5 by computing the Spearman rank correlation coefficient for each pair of lists in vectors V3 and V4. (All numbers in V5 are actually 1-Spearman rank coefficient so that higher numbers mean more different). We next sum all values in V5 to produce a Rank Feature Difference Score. Finally, we compute the difference between two by taking the average difference in their feature vectors plus the Rank Feature Difference Score. 3.1 Standardizing Variables While most of the features in the feature vector are percentages (% of adjectives, % of negative words, % of words that occur frequently in the Gigaword, etc.) some are on a different scale, such as the readability formulae. To test the impact of different scales on the performance of the methods we also perform all tests after standardizing the variables. We do this by scaling all variables to values between zero and one. 4 Experiments and Results In each of the experiments below all test documents contain exactly one anomalous segment and exactly 50 "normal". Whilst in reality it may be true that multiple are anomalous within a document, there is nothing implicit in the method which looks for a single anomalous piece of text; for the sake of simplicity of evaluation, we insert only one anomalous segment per document. The method returns a list of all ranked by how anomalous they are with respect to the whole document. If the program has performed well, then the truly anomalous segment should be at the top of the list (or very close to the top). Our assumption is that a human wishing to detect anomaly would be pleased if they could find the truly anomalous segment in the top 5 or 10 marked most likely to be anomalous, rather than having to scan the whole document or collection. The work presented here looks only at fixed-length with pre-determined boundaries, while a real application of such a technique might be required to function with vast differences between the sizes of. Once again, there is nothing implicit in the method assuming fixedlength sizes, and the choice to fix certain parameters of the experiments is to better illustrate the effect of segment length on the performance of the method. One could then use either paragraph breaks as natural segment boundaries, or employ more sophisticated segmentation techniques. We introduce a baseline for the following experiments that is the probably of selecting the truly anomalous segment by. For instance, the probability of choosing the single anomalous segment in a document that is 51 long by when picking 3 is 1/51 + 1/50 + 1/49 or 6%. 4.1 Authorship Tests For these sets of experiments we examine whether it is possible to distinguish anomaly of authorship at the segment level. We test this by taking a document written by one author and inserting in it a segment written by a different author. We then see if this segment can be detected using our unsupervised anomaly techniques. We create our experimental data from a collection consisting of 50 thousand words of text written by each of 8 Victorian authors: Charlotte Bronte, Lewis Carroll, Arthur Conan Doyle, George Eliot, Henry James, Rudyard Kipling, Alfred Lord Tennyson, H.G. Wells. Test sets are created for each pair of authors by inserting a segment from one author into a document written by the other author. This creates 56 sets of experiments (one for each author inserted into every author.) For example we insert a segment, one at a time from Bronte into Carroll and anomaly detection is performed. Likewise we insert one at a time from Carroll into Bronte and perform anomaly detection. Our experiment is always to test if we can detect this inserted paragraph. For each of the 56 combinations of authors we insert 30 random from one into the other, one at a time. We performed 56 pairs * 30 insertions each = 1,680 sets of insertion experiments. For each of these 1,680 insertion experiments we also varied the segment size to test its effect on anomaly detection. We then count what percentage of the time this paragraph falls within the top 3, top 5, top 10, and top 20 labeled by the program as anomalous. The results shown here report the average accuracy for each segment size (over all authors and insertions). The average percent of time we can detect anomalous in the top n varies according to the segment size, and as expected, the average accuracy increases as the segment size increases. For 1000 word, anomalous segment was found in the top 20 ranked about 95% of the time (81% in the top ten, 74% of the time in the top 5 and 66% of the time in the top three ). For 500 word, average accuracy ranged from 76% 1626
4 down to 47% and for 100 word it ranged from 65% down to 27% Table 1: Average Results for Authorship Tests 4.2 Fact versus Opinion In another set of experiments we tested whether opinion can be detected in a factual story. Our opinion text is made up of editorials from 4 newspapers making up a total of 28,200 words. Our factual text is newswire randomly chosen from the English Gigaword corpus [Graff, 2003] and consists of 4 different 50,000 word one each from one of the 4 news wire services. Each opinion text segment is inserted into each newswire service one at a time for at least 25 insertions on each newswire. Tests are performed like the authorship tests using 3 different segment sizes. Results in this set of experiments were generally better and the average accuracy for 1000 word, top 20 ranking was 91% (70% in the top 3 ranked.) Small segment sizes of 100 words also yielded good results and found the anomaly in the top 20, 92% of the time (although only 40% of the time in the top Table 2: Average results for fact versus opinion 4.3 Newswire versus Chinese Translations In this set of experiments we test whether English translations of Chinese newspaper can be detected in a collection of English newswire. We used 35 thousand words of Chinese newspaper that have been translated into English using Google's Chinese to English translation engine. English newswire text is the same randomly chosen 50,000 word from the Gigaword corpus. Again, tests are run using the 3 different segment sizes. For 1000 word, the average accuracy for detecting the anomalous document among the top 3 ranked is 93% and in the top 10 ranked 100% of the time Table 3: Average Results for Newswire versus Chinese Translations 1627
5 5 Conclusions There experimental results are very positive and show that, with a large segment size, we can detect the anomalous segment within the top 5 very accurately. In the author experiments where we inserted a segment from one author into 50 by another, we can detect the anomalous author's paragraph in the top 5 anomalous returned 74% of the time (for a segment size of 1000 words), which is considerably better than. If you randomly choose 5 out of 51 then you only have a 10.2% of correctly guessing the segment. Other experiments yielded similarly encouraging figures. The task with the best overall results was detecting when a Chinese translation was inserted into a newswire document, followed surprisingly by the task of detecting opinion articles amongst facts. On the whole, our experiments show that standardizing the scores on a scale from 0 to 1 does indeed produce better results for some types of anomaly detection but not for all the tasks we performed. We believe that in all cases where it performed worse than the standard raw scores, were cases where the genre distinction was very great. We performed an additional genre test not reported in this paper where Anarchist's Cookbook were inserted into newswire. Many of the readability formulas, for instance, distinguish these genre differences quite well but their effects on anomaly detection are greatly reduced when we standardize these scores. References [Allison and Guthrie, 2006] Ben Allison and Louise Guthrie. Detecting Anomalous Documents, Forthcoming. [Argamon et al., 1998] Shlomo Argamon, Moshe Koppel, Galit Avneri. Routing documents according to style. In Proceedings of the First International Workshop on Innovative Internet Information Systems (IIIS-98), Pisa, Italy, [Biber, 1988] Douglas Biber. Variation across speech and writing. Cambridge University Press, Cambridge, [Clough, 2000] Paul Clough. Analyzing style readability. University of Sheffield, [Clough et al., 2002] Paul Clough, Rob Gaizauskas, Scott Piao, Yorick Wilks. Measuring Text Reuse. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2002 [Glover and Hirst 1996] Angela Glover and Graeme Hirst. Detecting stylistic inconsistencies in collaborative writing. In Sharples, Mike and van der Geest, Thea (eds.), The new writing environment: Writers at work in a world of technology, Springer-Verlag, London, [Graff, 2003] David Graff. English Gigaword. Linguistic Data Consortium, catalog number LDC2003T05, 2003 [Kenny, 1982] Anthony Kenny. The computation of style: An introduction to statistics for students of literature and humanities, Pergamon Press, Oxford, [Kessler et al., 1997] Brett Kessler, Geoffrey Nunberg, Hinrich Schütze. Automatic Detection of Text Genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 32-38, [Maynard et al., 2001] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, Yorick Wilks. Named Entity Recognition from Diverse Text Types. Recent Advances in Natural Language Processing Conference, Tzigov Chark, Bulgaria, [Markou and Singh, 2003] Markos Markou and Sameer Sing. Novelty Detection: a review- parts 1 and 2. Signal Processing, 83(12): , [McColly, 1987] William B. McColly. Style and structure in the Middle English poem Cleanness. Computers and the Humanities, 21: [McEnery and Oakes, 2000] Tony McEnery and Michael Oakes. Authorship Identification and Computational Stylometry. In Robert Dale, Hermann Moisl and Harlod Somers (eds.) Hanbook of Natural Language Processing, , Marcel Dekker, New York, [Milic, 1967] Louis Tonko Milic. A quantitative approach to the style of Johnathan Swift. Studies in English literature, 23:317, The Hague: Mouton, [Milic, 1991] Louis Tonko Milic. Progress in stylistics: Theory, statistics, computers. Computers and the Humanities, 25: , [Morton, 1978] Andrew Queen Morton. Literary detection: How to prove authorship and fraud in literature and documents. Bowker Publishing Company, Bath, England, [Oakes, 1998] Mickael Oakes. Statistics for Corpus Linguistics, Edinburgh Textbooks in Empirical Linguistics, Edinburgh, [Stephens, 2000] Cheryl Stephens. All about Readability. Plain Language Network, ty.html, [Smith, 1998] MWA Smith. The Authorship of Acts I and II of Pericles: A new approach using first words of speeches. Computers and the Humanities, 22:23-41, [Wilks, 2004] Yorick Wilks. On the Ownership of Text. Computers and the Humanities. 38(2): ,
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Virginia English Standards of Learning Grade 8
A Correlation of Prentice Hall Writing Coach 2012 To the Virginia English Standards of Learning A Correlation of, 2012, Introduction This document demonstrates how, 2012, meets the objectives of the. Correlation
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)
Year 1 reading expectations Year 1 writing expectations Responds speedily with the correct sound to graphemes (letters or groups of letters) for all 40+ phonemes, including, where applicable, alternative
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 4 I. READING AND LITERATURE A. Word Recognition, Analysis, and Fluency The student
Author Identification for Turkish Texts
Çankaya Üniversitesi Fen-Edebiyat Fakültesi, Journal of Arts and Sciences Say : 7, May s 2007 Author Identification for Turkish Texts Tufan TAŞ 1, Abdul Kadir GÖRÜR 2 The main concern of author identification
Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8
Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8 Pennsylvania Department of Education These standards are offered as a voluntary resource
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,
Section 9 Foreign Languages I. OVERALL OBJECTIVE To develop students basic communication abilities such as listening, speaking, reading and writing, deepening their understanding of language and culture
Simple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd [email protected] Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5
Page 1 of 57 Grade 3 Reading Literary Text Principles of Reading (P) Standard 1: Demonstrate understanding of the organization and basic features of print. Standard 2: Demonstrate understanding of spoken
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland
10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words
Standard 3: Writing Process 3.1: Prewrite 58-69% 10.LA.3.1.2 Generate a main idea or thesis appropriate to a type of writing. (753.02.b) Items may include a specified purpose, audience, and writing outline.
Grade Genre Skills Lessons Mentor Texts and Resources 6 Grammar To Be Covered
Grade Genre Skills Lessons Mentor Texts and Resources 6 Grammar To Be Covered 6 Personal Narrative Parts of speech (noun, adj, verb, adv) Complete sentence (subj. and verb) Capitalization Tense (identify)
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
Grade: 9 (1) Students will build a framework for high school level academic writing by understanding the what of language, including:
Introduction: The following document is a draft of standards-designed, comprehensive Pacing Guide for high school English Grade 9. This document will evolve as feedback is accumulated. The Pacing Guide
COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE
SPAN 100/101 ELEMENTARY SPANISH COURSE OBJECTIVES This Spanish course pays equal attention to developing all four language skills (listening, speaking, reading, and writing), with a special emphasis on
Understanding Clauses and How to Connect Them to Avoid Fragments, Comma Splices, and Fused Sentences A Grammar Help Handout by Abbie Potter Henry
Independent Clauses An independent clause (IC) contains at least one subject and one verb and can stand by itself as a simple sentence. Here are examples of independent clauses. Because these sentences
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
CENTRAL TEXAS COLLEGE SYLLABUS FOR DIRW 0305 PRINCIPLES OF ACADEMIC LITERACY. Semester Hours Credit: 3 INSTRUCTOR: OFFICE HOURS:
CENTRAL TEXAS COLLEGE SYLLABUS FOR DIRW 0305 PRINCIPLES OF ACADEMIC LITERACY Semester Hours Credit: 3 INSTRUCTOR: OFFICE HOURS: I. INTRODUCTION Principles of Academic Literacy (DIRW 0305) is a Non-Course-Based-Option
1. Define and Know (D) 2. Recognize (R) 3. Apply automatically (A) Objectives What Students Need to Know. Standards (ACT Scoring Range) Resources
T 1. Define and Know (D) 2. Recognize (R) 3. Apply automatically (A) ACT English Grade 10 Rhetorical Skills Organization (15%) Make decisions about order, coherence, and unity Logical connections between
Cambridge Primary English as a Second Language Curriculum Framework
Cambridge Primary English as a Second Language Curriculum Framework Contents Introduction Stage 1...2 Stage 2...5 Stage 3...8 Stage 4... 11 Stage 5...14 Stage 6... 17 Welcome to the Cambridge Primary English
A Guide to Cambridge English: Preliminary
Cambridge English: Preliminary, also known as the Preliminary English Test (PET), is part of a comprehensive range of exams developed by Cambridge English Language Assessment. Cambridge English exams have
Level 4 Certificate in English for Business
Level 4 Certificate in English for Business LCCI International Qualifications Syllabus Effective from January 2006 For further information contact us: Tel. +44 (0) 8707 202909 Email. [email protected]
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details
Strand: Reading Literature Key Ideas and Details Craft and Structure RL.3.1 Ask and answer questions to demonstrate understanding of a text, referring explicitly to the text as the basis for the answers.
No Evidence. 8.9 f X
Section I. Correlation with the 2010 English Standards of Learning and Curriculum Framework- Grade 8 Writing Summary Adequate Rating Limited No Evidence Section I. Correlation with the 2010 English Standards
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
National Quali cations SPECIMEN ONLY
N5 SQ40/N5/02 FOR OFFICIAL USE National Quali cations SPECIMEN ONLY Mark Urdu Writing Date Not applicable Duration 1 hour and 30 minutes *SQ40N502* Fill in these boxes and read what is printed below. Full
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
English Appendix 2: Vocabulary, grammar and punctuation
English Appendix 2: Vocabulary, grammar and punctuation The grammar of our first language is learnt naturally and implicitly through interactions with other speakers and from reading. Explicit knowledge
The Oxford Learner s Dictionary of Academic English
ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students
PTE Academic Preparation Course Outline
PTE Academic Preparation Course Outline August 2011 V2 Pearson Education Ltd 2011. No part of this publication may be reproduced without the prior permission of Pearson Education Ltd. Introduction The
2016-2017 Curriculum Catalog
2016-2017 Curriculum Catalog 2016 Glynlyon, Inc. Table of Contents LANGUAGE ARTS 600 COURSE OVERVIEW... 1 UNIT 1: ELEMENTS OF GRAMMAR... 3 UNIT 2: GRAMMAR USAGE... 3 UNIT 3: READING SKILLS... 4 UNIT 4:
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
ESL 005 Advanced Grammar and Paragraph Writing
ESL 005 Advanced Grammar and Paragraph Writing Professor, Julie Craven M/Th: 7:30-11:15 Phone: (760) 355-5750 Units 5 Email: [email protected] Code: 30023 Office: 2786 Room: 201 Course Description:
Academic Standards for Reading, Writing, Speaking, and Listening
Academic Standards for Reading, Writing, Speaking, and Listening Pre-K - 3 REVISED May 18, 2010 Pennsylvania Department of Education These standards are offered as a voluntary resource for Pennsylvania
How to Paraphrase Reading Materials for Successful EFL Reading Comprehension
Kwansei Gakuin University Rep Title Author(s) How to Paraphrase Reading Materials Comprehension Hase, Naoya, 長 谷, 尚 弥 Citation 言 語 と 文 化, 12: 99-110 Issue Date 2009-02-20 URL http://hdl.handle.net/10236/1658
PROGRAMME ANNEE SCOLAIRE 2015/2016. Matière/Subject : English Langue d enseignement/teaching Language : English Nom de l enseignant/teacher s Name:
Classe/Class : CM2 Section : Matière/Subject : Langue d enseignement/teaching Language : Nom de l enseignant/teacher s Name: Nombre d heures hebdomadaires/number of hours per week : Approx. 4½ hours. Présentation
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
Counting Money and Making Change Grade Two
Ohio Standards Connection Number, Number Sense and Operations Benchmark D Determine the value of a collection of coins and dollar bills. Indicator 4 Represent and write the value of money using the sign
AK + ASD Writing Grade Level Expectations For Grades 3-6
Revised ASD June 2004 AK + ASD Writing For Grades 3-6 The first row of each table includes a heading that summarizes the performance standards, and the second row includes the complete performance standards.
Language Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
Discourse Markers in English Writing
Discourse Markers in English Writing Li FENG Abstract Many devices, such as reference, substitution, ellipsis, and discourse marker, contribute to a discourse s cohesion and coherence. This paper focuses
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
EAS Basic Outline. Overview
EAS Basic Outline Overview This is the course outline for your English Language Basic Course. This course is delivered at pre intermediate level of English, and the course book that you will be using is
EAP 1161 1660 Grammar Competencies Levels 1 6
EAP 1161 1660 Grammar Competencies Levels 1 6 Grammar Committee Representatives: Marcia Captan, Maria Fallon, Ira Fernandez, Myra Redman, Geraldine Walker Developmental Editor: Cynthia M. Schuemann Approved:
12 FIRST QUARTER. Class Assignments
August 7- Go over senior dates. Go over school rules. 12 FIRST QUARTER Class Assignments August 8- Overview of the course. Go over class syllabus. Handout textbooks. August 11- Part 2 Chapter 1 Parts of
The University of Texas at Austin
The University of Texas at Austin Performing Arts Center Curriculum Guide Series Music Reviews A Genre Study Includes introduction, resources, standards, and student handouts. Educational Programs Coordinator
Cambridge International Examinations Cambridge International General Certificate of Secondary Education
Cambridge International Examinations Cambridge International General Certificate of Secondary Education FIRST LANGUAGE ENGLISH 0522/03 Paper 3 Directed Writing and Composition For Examination from 2015
CS 533: Natural Language. Word Prediction
CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed
Common Core Progress English Language Arts
[ SADLIER Common Core Progress English Language Arts Aligned to the [ Florida Next Generation GRADE 6 Sunshine State (Common Core) Standards for English Language Arts Contents 2 Strand: Reading Standards
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
Reliable and Cost-Effective PoS-Tagging
Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,[email protected] Abstract In order to achieve
Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details
Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details Craft and Structure RL.5.1 Quote accurately from a text when explaining what the text says explicitly and when
Units of Study 9th Grade
Units of Study 9th Grade First Semester Theme: The Journey Second Semester Theme: Choices The Big Ideas in English Language Arts that drive instruction: Independent thinkers construct meaning through language.
Multiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
KINDGERGARTEN. Listen to a story for a particular reason
KINDGERGARTEN READING FOUNDATIONAL SKILLS Print Concepts Follow words from left to right in a text Follow words from top to bottom in a text Know when to turn the page in a book Show spaces between words
Using Machine Learning Techniques for Stylometry
Using Machine Learning Techniques for Stylometry Ramyaa, Congzhou He, Khaled Rasheed Artificial Intelligence Center The University of Georgia, Athens, GA 30602 USA Abstract In this paper we describe our
How to write a technique essay? A student version. Presenter: Wei-Lun Chao Date: May 17, 2012
How to write a technique essay? A student version Presenter: Wei-Lun Chao Date: May 17, 2012 1 Why this topic? Don t expect others to revise / modify your paper! Everyone has his own writing / thinkingstyle.
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Q FACTOR ANALYSIS (Q-METHODOLOGY) AS DATA ANALYSIS TECHNIQUE
Q FACTOR ANALYSIS (Q-METHODOLOGY) AS DATA ANALYSIS TECHNIQUE Gabor Manuela Rozalia Petru Maior Univerity of Tg. Mure, Faculty of Economic, Legal and Administrative Sciences, [email protected], 0742
Rubrics for Assessing Student Writing, Listening, and Speaking High School
Rubrics for Assessing Student Writing, Listening, and Speaking High School Copyright by the McGraw-Hill Companies, Inc. All rights reserved. Permission is granted to reproduce the material contained herein
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Narrative Literature Response Letters Grade Three
Ohio Standards Connection Writing Applications Benchmark A Write narrative accounts that develop character, setting and plot. Indicator: 1 Write stories that sequence events and include descriptive details
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
Grade 4 Writing Assessment. Eligible Texas Essential Knowledge and Skills
Grade 4 Writing Assessment Eligible Texas Essential Knowledge and Skills STAAR Grade 4 Writing Assessment Reporting Category 1: Composition The student will demonstrate an ability to compose a variety
English Language Proficiency Standards: At A Glance February 19, 2014
English Language Proficiency Standards: At A Glance February 19, 2014 These English Language Proficiency (ELP) Standards were collaboratively developed with CCSSO, West Ed, Stanford University Understanding
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134)
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134) Very young children learn vocabulary items related to the different concepts they are learning. When children learn numbers or colors in
Albert Pye and Ravensmere Schools Grammar Curriculum
Albert Pye and Ravensmere Schools Grammar Curriculum Introduction The aim of our schools own grammar curriculum is to ensure that all relevant grammar content is introduced within the primary years in
the treasure of lemon brown by walter dean myers
the treasure of lemon brown by walter dean myers item analysis for all grade 7 standards: vocabulary, reading, writing, conventions item analysis for all grade 8 standards: vocabulary, reading, writing,
Meeting the Standard in North Carolina
KINDERGARTEN 1.02 Demonstrate understanding of the sounds of letters and understanding that words begin and end alike (onsets and rhymes). 1.03 Recognize and name upper and lower case letters of the alphabet.
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
CST and CAHSEE Academic Vocabulary
CST and CAHSEE Academic Vocabulary Grades K 12 Math and ELA This document references Academic Language used in the Released Test Questions from the 2008 posted CAHSEE Released Test Questions (RTQs) and
ESL I English as a Second Language I Curriculum
ESL I English as a Second Language I Curriculum ESL Curriculum alignment with NJ English Language Proficiency Standards (Incorporating NJCCCS and WIDA Standards) Revised November, 2011 The ESL program
Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction
Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
3rd Grade - ELA Writing
3rd Grade - ELA Text Types and Purposes College & Career Readiness 1. Opinion Write arguments to support claims in an analysis of substantive topics or texts, using valid reasoning and relevant and sufficient
Log-Linear Models. Michael Collins
Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:
This image cannot currently be displayed. Course Catalog. Language Arts 600. 2016 Glynlyon, Inc.
This image cannot currently be displayed. Course Catalog Language Arts 600 2016 Glynlyon, Inc. Table of Contents COURSE OVERVIEW... 1 UNIT 1: ELEMENTS OF GRAMMAR... 3 UNIT 2: GRAMMAR USAGE... 3 UNIT 3:
English Grammar Passive Voice and Other Items
English Grammar Passive Voice and Other Items In this unit we will finish our look at English grammar. Please be aware that you will have only covered the essential basic grammar that is commonly taught
English-Japanese machine translation
[Information processing: proceedings of the International Conference on Information Processing, Unesco, Paris 1959] 163 English-Japanese machine translation By S. Takahashi, H. Wada, R. Tadenuma and S.
Common Core Progress English Language Arts. Grade 3
[ SADLIER Common Core Progress English Language Arts Aligned to the Florida [ GRADE 6 Next Generation Sunshine State (Common Core) Standards for English Language Arts Contents 2 Strand: Reading Standards
Progression in recount
Progression in recount Purpose Recounts (or accounts as they are sometimes called) are the most common kind of texts we encounter and create. Their primary purpose is to retell events. They are the basic
How To Write The English Language Learner Can Do Booklet
WORLD-CLASS INSTRUCTIONAL DESIGN AND ASSESSMENT The English Language Learner CAN DO Booklet Grades 9-12 Includes: Performance Definitions CAN DO Descriptors For use in conjunction with the WIDA English
The. Languages Ladder. Steps to Success. The
The Languages Ladder Steps to Success The What is it? The development of a national recognition scheme for languages the Languages Ladder is one of three overarching aims of the National Languages Strategy.
Name: Note that the TEAS 2009 score report for reading has the following subscales:
Name: Writing, Reading, and Language Center Software Activities Relevant for TEAS Reading Preparation The WRLC is located in room MT-020, below the library. These activities correspond roughly to TEAS
