Learning To Deal With Little Or No Annotated Data
|
|
- Evelyn Robbins
- 8 years ago
- Views:
Transcription
1 Learning To Deal With Little Or No Annotated Data Information Sciences Institute and Department of Computer Science University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA
2 Overview There is no better data than more data. Annotating data is more cost effective than writing rules manually. Still, annotating data is expensive How can we annotate as little data as possible? Active Learning Bootstrapping Co-training Unsupervised Learning. Pattern Discovery Hidden Variables (the EM algorithm) Corpus Exploitation for Summarization.
3 Choosing between confusables [Banko and Brill, ACL-2001] (two, too, to) (principal, principle) (then, than)
4 Base Noun Phrase Chunking [Ngai and Yarowsky, ACL-2000] Asked human judges to write rules that can be used to identify base noun phrases and automatically integrated those rules into a rule-based chunker. Asked human judges to annotate base noun phrases in naturally occurring text and trained a ML-based system to recognize these phrases. Compared the performance of the two ruleml-based systems. USCand INFORMATION SCIENCES INSTITUTE
5 It pays off to annotate data
6 It matters who annotates the data
7 How can we do well while annotating less data? Active learning Active learning with one classifier Active learning with a committee of classifiers Bootstrapping Bootstrapping with one classifier Bootstrapping with a committee of classifiers Co-training
8 Active learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data. 4. Apply classifier on unlabeled examples. 5. Elicit human judgments for examples on which classifier had the lowest confidence. 6. Add new labeled data to the annotated corpus. 7. Retrain classifier and test on held-out data. 8. If improvement, go to 2.
9 Active learning with multiple classifiers Input: small annotated corpus + large un-annotated corpus. 3. Train multiple classifiers on annotated data. 4. Apply classifiers on unlabeled examples. 5. Elicit human judgments for examples on which classifiers agree the least. 6. Add new labeled data to the annotated corpus. 7. Retrain classifier and test on held-out data. 8. If improvement, go to 2.
10 Active learning helps [Banko and Brill, ACL-2001]
11 Active learning helps [Ngai and Yarowsky, NAACL-2000]
12 Active learning worked in all cases that I know of.
13 Bootstrapping with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data. 4. Apply classifier on unlabeled examples. 5. Add to the training corpus the examples that are labeled with high confidence. 6. Retrain classifier (and test on held-out data). 7. If improvement, go to 2.
14 Bootstrapping with multiple classifiers Input: small annotated corpus + large un-annotated corpus. 3. Train classifiers on annotated data. 4. Apply classifiers on unlabeled examples. 5. Add to the training corpus the examples that are given the same label by all (most of) the classifiers. 6. Retrain classifiers (and test on held-out data). 7. If improvement, go to 2.
15 Bootstrapping example [Yarowsky ACL-95] Extract from a corpus all instances of a polysemous word (7538 instances of plant). Sense??????? Training Examples company said the plant is still operating Although thousands of plant and animal species zonal distribution of plant life to strain microscopic plant life from the Nissan car and truck plant in Japan discovered at a St. Louis plant manufacturing automated manufacturing plant in Fremont
16 Start with a simple classifier and create a seed corpus Start with a simple classifier: plant A; manufacturing B 82 examples of living plants (1%) 106 examples of manufacturing plants (1%) 7360 residual examples Sense A A A???? B B B Training Examples zonal distribution of plant life to strain microscopic plant life from the Nissan car and truck plant in Japan company said the plant is still operating Although thousands of plant and animal species discovered at a St. Louis plant manufacturing automated manufacturing plant in Fremont
17 Seed corpus
18 1. Train supervised classifier on seed corpus Collocation Sense plant life A manufacturing plant B life (within 2-10 words) A manufacturing (within 2-10 words) B animal (within 2-10 words) A equipment (within 2-10 words) B
19 2. Apply classifier on entire data
20 Rest of the algorithm 1. Optionally use one-sense-per-discourse filter and augment labeled data 2. Repeat steps 1, 2, 3 iteratively. Evaluation: Baseline: 63.9% Supervised: 96.1% Bootstrapping: 96.5%
21 Bootstrapping does not work in all cases (than vs. then [Banko and Brill, ACL-2001]) Test accuracy % Total training data 106 with labeled seed corpus Seed + 5 x 106 unsupervised Seed unsupervised Seed unsupervised Seed + 5 x 108 unsupervised supervised
22 Co-training [Blum and Mitchell, COLT-1998] My advisor Professor Smith Professor John Smith I teach computer courses and advise students including Mary Kae Bill Blue I work on the following projects: - machine learning for web classification - active learning for NLP - software engineering
23 Co-training Input: L set of labeled training examples U set of unlabeled examples Loop: Learn hyperlink-based classifier H from L. Learn full-text classifier F from L. Allow H to label p positive and n negative examples from U (same distribution as in L). Allow F to label p positive and n negative examples from U. Add these self-labeled examples to L. Why does this work? Examples that are easy to label by classifier X may be hard cases for the classifier Y. Classifier Y may learn something new from the examples labeled by X.
24 Example: Error rates for a web classifier Problem: classify web pages as academic course (yes or no). Data: 16 labeled examples and 800 unlabeled examples taken from one department. Page-based classifier Hyperlink-based classifier Combined classifier (equal votes) Supervised training Co-training
25 Co-training does not work in all cases [Pierce and Cardie, EMNLP-2001] Task: identification of base nouns based on left and right context words.
26 Unsupervised learning Pattern discovery. Language modeling (text as sequence of words) Unsupervised induction of syntactic structure. Unsupervised induction of POS taggers and base noun identifiers for non-english. Hidden variables: the EM algorithm.
27 Language modeling as soon as I would like P(w1, w2, w3,, wn) Useful in Speech recognition Machine translation Summarization/Generation Any application in which we produce text
28 N-grams models P(w1, w2,, wn) = P(w1) P(w2 w1) P(w3 w1, w2) P(w4 w1, w2, w3 ) P(wn w1, w2,, wn-1) P(w1) P(w2 w1) P(w3 w1, w2) P(w4 w2, w3 ) P(wn wn-2, wn-1) Estimation: P(c a, b) = count(a, b, c)/count(a, b) Smoothing when count(a,b) or count(a,b,c) are 0. Still the most popular language model: never underestimate the power of n-grams. Syntax-based language models: [Chelba and Jelinek, ACLUSC1998], INFORMATION SCIENCES INSTITUTE [Charniak, ACL-2001]. [Roark, CL-2001],
29 Unsupervised induction of syntactic structures [van Zaanen ICML-2000] [Harris, 1951] Methods in structural linguistics. University of Chicago Press: Two constituents of the same type can be replaced. IDEA: Find in a corpus parts of sentences that can be replaced and assume that these parts are syntactic constituents. Example: Show me (flights from Atlanta to Boston.) Show me (the rates for flight 1943.) (Book Delta 128) from Dallas to Boston. (Give me all flights) from Dallas to Boston.
30 Algorithm 1. Find overlapping segments in all sentence pairs (string edit distance). Dissimilar parts are considered possible constituents and are assigned unique types (labels: X1, X2 ). 2. When multiple overlaps occur use various selection criteria First learned constituent is good. Constituent that occurs most often is good. 3. Apply steps 1 and 2 recursively.
31 Evaluation ATIS corpus: 716 sentences with 11,777 constituents. Examples: Corpus: What is (the name of (the airport in Boston) NP ) NP Learned: What is the (name of the (airport in Boston) C ) C Corpus: Explain classes ((QW) NP and (QX) NP and (Y) NP) NP Learned: Explain classes QW and (QX and (Y) C ) C Non-crossing bracket precision: Non-crossing bracket recall: Lots of room for improvement: Weakening the exact match. Large scale experiments.
32 Induction of POS taggers and base noun identifiers for non-english languages [Yarowsky and Ngai NAACL2001] For many languages, no NLP analyzers exist. Bottleneck: lack of labeled data. IDEA: use parallel corpora and existing statistical machine translation software/techniques to automatically label non-english texts.
33 Projecting POS tag and base noun-phrase structure across languages
34 Difficulties Statistical MT alignment programs yield relatively low accuracy word alignments. Very often translations are not literal. Mismatch between the annotation needs of two languages (gender in French and English).
35 POS tagger induction Run GIZA ( on parallel corpus of 2M words. Run POS tagger on English text. Automatically induce tags for the French. Train probabilistic noisy-channel tagger on automatically induced French tags. Downweight or exclude from the training data the segments that are likely to be aligned poorly. Train lexical priors P(t w) and tag sequence models P(t2 t1) using aggressive generalization techniques (most words have one possible core tag). Test performance on held-out data and out-of-domain manually annotated data (100k: U. Montreal)
36 Evaluation E-F Aligned French Direct transfer: 76% Standard noisy-channel: 86% Noise-robust noisy-channel: 96% Upperbound (trained on heldout goldstandard): 97% Out-of-domain data: Standard noisy-channel: 82% Noise-robust noisy-channel: 94% Upperbound (trained on heldout goldstandard): 98%
37 NP bracketer induction Tag and bracket English text [Brill,CL-1999; Ramshaw and Marcus, VLC-1999] Induce maximal brackets in French/Chinese. Train transformation-based learning (TBL) bracketer on French/Chinese data. Test performance on small corpus of held out sentences (no French or Chinese NP bracketer exists).
38 Evaluation on 50 French sentences Direct, F-measure: Exact % Acceptable % TBL, F-measure: Exact % Acceptable %
39 Hidden variables: the EM algorithm [Knight, AI Magazine, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.
40 EM (Estimation Maximization) [Dempster, Laird, Rubin; JRST, 1977] EM is good for solving chicken and egg problems. Translation: If we knew the word level alignments in a corpus, we would know to estimate t(f e). If we knew t(f e), we would be able to find the word-level alignments in a corpus. Problem to solve: find the word-level alignments and the translation probabilities given this corpus: 1e: b c 1f: x y P(a,f e) = j=1,m t(fj eaj) e: b 2f: y
41 The EM algorithm [Knight, 1999 SMT Tutorial Book] Step 1: Set parameters uniformly. t(x b ) = ½ t(y b) = ½ t(x c) = ½ t(y c) = ½ Step 2: Compute P(a, f e) for all alignments. bc P(a,f e) = ½ * ½ = ¼ xy b P(a, f e) = 1/2 y bc P(a, f e) = ½ * ½ = 1/4 xy
42 The EM algorithm Step 3: Normalize P(a, f e) values to yield P(a e, f) bc P(a e, f) = ¼ / ( ¼ + ¼) = ½ 1 xy b P(a, f e) = ½ / ½ = y bc P(a, f e) = ¼ / ( ¼ + ¼) = ½ xy Step 4: Collect fractional counts tc(x b ) = ½ tc(y b) = ½ + 1 = 3/2 tc(x c) = ½ tc(y c) = ½
43 The EM algorithm Step 5: Normalize fractional counts to get revised parameter values. t(x b) = ½ / ( 3/2 + ½) = ¼ t(y b) = 3/2 / (3/2 + ½) = ¾ t(x c) = ½ / 1 = ½ t(y c) = ½ / 1 = ½ Repeat step 2: Compute P(a, f e) for all alignments. bc P(a,f e) = ¼ * ½ = 1/8 xy b P(a, f e) = 3/4 y bc P(a, f e) = 3/4 * ½ = 3/8 xy
44 The EM algorithm Repeat step 3: Normalize P(a, f e) values to yield P(a e, f) bc P(a e, f) = 1/8 / (1/8 + 3/8) = ¼ xy b P(a, f e) = 1 y bc P(a, f e) = ¾ xy Repeat step 4: Collect fractional counts tc(x b ) = ¼ tc(y b) = ¾ + 1 = 7/4 tc(x c) = ¾ tc(y c) = ¼
45 The EM algorithm Step 5: Normalize fractional counts to get revised parameter values. t(x b) = 1/8 t(y b) = 7/8 t(x c) = 3/4 t(y c) = 1/4 Repeat steps 2-5 many times: t(x b) = t(y b) = t(x c) = t(y c) = At each step, the EM algorithm improves P(f e) for the whole corpus.
46 EM allows one to make MLE under adverse circumstances [Pedersen, EMNLP-2001 EM Panel] MLE (Maximum Likelihood Estimates) Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population. A MLE is a parameter estimate that is most consistent with the sampled data. It maximizes the likelihood of the data P(X Θ). ΘML = argmax Θ L(X Θ).
47 Trivial example: coin tossing 10 trials: h, t, t, t, h, t, t, h, t, t One parameter: = P(h) The MLE is 3/10. Explanation: Given 10 tosses, how likely it is to get 3 heads? L( ) = C103 3 (1 )7 Take the derivative of the log L( )
48 EM: a more complex example Most often, for multinomial distributions it is not possible to find the MLE using closed form formulas. 1e: b c Parameters: = {t(x b), t(x c), t(y b), t(y c)} 1f: x y L(X ) = P(e f) = a P(a, f e) e: b 2f: y Maximizing L(X ) has no closed form solution in this case. E-step: Find the expected values of the complete data, given the incomplete data and the current parameter estimates (steps 2 and 3) M-step: Compute MLE as usual (steps 4 and 5).
Lecture 21: Machine translation
(Fall 2012) Machine Translation http://cs.illinois.edu/class/cs498jh Lecture 21: Machine translation Google Translate Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday,
More informationWhat s New in Statistical Machine Translation
What s New in Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department U A I 2 0 0 3 Machine Translation 美 国 关 岛 国 际 机 场 及 其 办 公 室 均 接 获 一 名 自 称 沙
More informationHow To Model The Language Of A Language With A Language Model
Statistical Machine Translation: Trends & Challenges 2 nd International Conference on Arabic Language Resources & Tools 21 st April 2009 Prof. Andy Way NCLT/CNGL, School of Computing, Dublin City University,
More informationApplying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu
Applying Co-Training Methods to Statistical Parsing Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu 1 Statistical Parsing: the company s clinical trials of both its animal and human-based
More informationSearch and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
More informationPOS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches
POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging 2 Words taken isolatedly are ambiguous regarding its POS Yo bajo con el hombre bajo a PP AQ
More informationBrill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
More informationClustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
More informationEffective Self-Training for Parsing
Effective Self-Training for Parsing David McClosky dmcc@cs.brown.edu Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - dmcc@cs.brown.edu
More informationPoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationPhrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.
Phrase-Based MT Machine Translation Lecture 7 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Translational Equivalence Er hat die Prüfung bestanden, jedoch
More informationSWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University
More informationA Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow
A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system
More informationMachine Translation. Agenda
Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm
More informationNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More informationAn Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com
More informationA Systematic Comparison of Various Statistical Alignment Models
A Systematic Comparison of Various Statistical Alignment Models Franz Josef Och Hermann Ney University of Southern California RWTH Aachen We present and compare various methods for computing word alignments
More informationCustomizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
More informationNatural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression
Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15
More informationChapter 5. Phrase-based models. Statistical Machine Translation
Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many
More informationPOSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
More informationWhat Is This, Anyway: Automatic Hypernym Discovery
What Is This, Anyway: Automatic Hypernym Discovery Alan Ritter and Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle,
More informationConvergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past
More informationThe XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
More informationExploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 125-132. Association for Computational Linguistics. Exploiting Strong Syntactic Heuristics
More informationFactored Language Models for Statistical Machine Translation
Factored Language Models for Statistical Machine Translation Amittai E. Axelrod E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science by Research Institute for Communicating and Collaborative
More informationComma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationAutomatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
More informationLeveraging ASEAN Economic Community through Language Translation Services
Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)
More informationOpen Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
More informationStatistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
More informationNLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models
NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Part of Speech (POS) Tagging Given a sentence X, predict its
More informationTranslating the Penn Treebank with an Interactive-Predictive MT System
IJCLA VOL. 2, NO. 1 2, JAN-DEC 2011, PP. 225 237 RECEIVED 31/10/10 ACCEPTED 26/11/10 FINAL 11/02/11 Translating the Penn Treebank with an Interactive-Predictive MT System MARTHA ALICIA ROCHA 1 AND JOAN
More informationStatistical NLP Spring 2008. Machine Translation: Examples
Statistical NLP Spring 2008 Lecture 11: Word Alignment Dan Klein UC Berkeley Machine Translation: Examples 1 Machine Translation Madame la présidente, votre présidence de cette institution a été marquante.
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationIdentifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of
More informationCross-Task Knowledge-Constrained Self Training
Cross-Task Knowledge-Constrained Self Training Hal Daumé III School of Computing University of Utah Salt Lake City, UT 84112 me@hal3.name Abstract We present an algorithmic framework for learning multiple
More informationStatistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
More informationSemi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
More informationLog-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models
Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models Natural Language Processing CS 6120 Spring 2014 Northeastern University David Smith (some slides from Jason Eisner and Dan Klein) summary
More informationWhy Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?
Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?
More informationMachine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!
Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult
More informationTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
More informationPredictive Coding Defensibility and the Transparent Predictive Coding Workflow
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by
More informationTurker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
More informationEmpirical Machine Translation and its Evaluation
Empirical Machine Translation and its Evaluation EAMT Best Thesis Award 2008 Jesús Giménez (Advisor, Lluís Màrquez) Universitat Politècnica de Catalunya May 28, 2010 Empirical Machine Translation Empirical
More informationLearning and Inference over Constrained Output
IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationPredictive Coding Defensibility and the Transparent Predictive Coding Workflow
WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive
More information11/15/10 + NELL. Natural Language Processing. NELL: Never-Ending Language Learning
+ + http://www.youtube.com/watch?v=u_gbswe_kye http://en.wikipedia.org/wiki/vocaloid Natural Language Processing CS151 David Kauchak + NELL + NELL NELL: Never-Ending Language Learning http://rtw.ml.cmu.edu/rtw/
More informationStatistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
More informationLanguage and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University tamas.biro@yale.edu http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
More informationAn End-to-End Discriminative Approach to Machine Translation
An End-to-End Discriminative Approach to Machine Translation Percy Liang Alexandre Bouchard-Côté Dan Klein Ben Taskar Computer Science Division, EECS Department University of California at Berkeley Berkeley,
More informationMulti-Engine Machine Translation by Recursive Sentence Decomposition
Multi-Engine Machine Translation by Recursive Sentence Decomposition Bart Mellebeek Karolina Owczarzak Josef van Genabith Andy Way National Centre for Language Technology School of Computing Dublin City
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationWord Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
More informationImproving Word-Based Predictive Text Entry with Transformation-Based Learning
Improving Word-Based Predictive Text Entry with Transformation-Based Learning David J. Brooks and Mark G. Lee School of Computer Science University of Birmingham Birmingham, B15 2TT, UK d.j.brooks, m.g.lee@cs.bham.ac.uk
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationBoosting the Feature Space: Text Classification for Unstructured Data on the Web
Boosting the Feature Space: Text Classification for Unstructured Data on the Web Yang Song 1, Ding Zhou 1, Jian Huang 2, Isaac G. Councill 2, Hongyuan Zha 1,2, C. Lee Giles 1,2 1 Department of Computer
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationTerminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
More informationTHE ICSI/SRI/UW RT04 STRUCTURAL METADATA EXTRACTION SYSTEM. Yang Elizabeth Shriberg1;2 Andreas Stolcke1;2 Barbara Peskin1 Mary Harper3
Liu1;3 THE ICSI/SRI/UW RT04 STRUCTURAL METADATA EXTRACTION SYSTEM Yang Elizabeth Shriberg1;2 Andreas Stolcke1;2 Barbara Peskin1 Mary Harper3 1International Computer Science Institute, USA2SRI International,
More informationBenefits of HPC for NLP besides big data
Benefits of HPC for NLP besides big data Barbara Plank Center for Sprogteknologie (CST) University of Copenhagen, Denmark http://cst.dk/bplank Web-Scale Natural Language Processing in Northern Europe Oslo,
More informationA Mutually Beneficial Integration of Data Mining and Information Extraction
In the Proceedings of the Seventeenth National Conference on Artificial Intelligence(AAAI-2000), pp.627-632, Austin, TX, 20001 A Mutually Beneficial Integration of Data Mining and Information Extraction
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Introduction Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 13 Introduction Goal of machine learning: Automatically learn how to
More informationData Selection in Semi-supervised Learning for Name Tagging
Data Selection in Semi-supervised Learning for Name Tagging Abstract We present two semi-supervised learning techniques to improve a state-of-the-art multi-lingual name tagger. They improved F-measure
More informationThe Expectation Maximization Algorithm A short tutorial
The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected
More information31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
More informationSystematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies
More informationA Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA tafvizi@csail.mit.edu Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA mpacula@csail.mit.edu Abstract
More informationTagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
More informationAn Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them
An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,
More informationMultilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia Sungchul Kim POSTECH Pohang, South Korea subright@postech.ac.kr Abstract In this paper we propose a method to automatically
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationCS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature
CS4025: Pragmatics Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature For more info: J&M, chap 18,19 in 1 st ed; 21,24 in 2 nd Computing Science, University of
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationUNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics
More information3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work
Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp
More informationApplications of Deep Learning to the GEOINT mission. June 2015
Applications of Deep Learning to the GEOINT mission June 2015 Overview Motivation Deep Learning Recap GEOINT applications: Imagery exploitation OSINT exploitation Geospatial and activity based analytics
More informationFactored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
More informationACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no.
ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. 248347 Deliverable D5.4 Report on requirements, implementation
More informationApplications of Named Entity Recognition in Customer Relationship Management Systems
Applications of Named Entity Recognition in Customer Relationship Management Systems Farbod Saraf Jadidian September 2014 Dissertation submitted in partial fulfilment for the degree of Master of Science
More information11-792 Software Engineering EMR Project Report
11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationSentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
More informationDiscovering process models from empirical data
Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,
More informationSYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
More informationBinarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy Wei Wang and Kevin Knight and Daniel Marcu Language Weaver, Inc. 4640 Admiralty Way, Suite 1210 Marina del Rey, CA, 90292 {wwang,kknight,dmarcu}@languageweaver.com
More informationAutomatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
More informationHybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
More informationParsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh mroth@inf.ed.ac.uk Ewan Klein University of Edinburgh ewan@inf.ed.ac.uk Abstract Software
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationUsing Trace Clustering for Configurable Process Discovery Explained by Event Log Data
Master of Business Information Systems, Department of Mathematics and Computer Science Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master Thesis Author: ing. Y.P.J.M.
More informationA Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu
More informationWikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
More informationSYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
More informationQuestion Answering and Multilingual CLEF 2008
Dublin City University at QA@CLEF 2008 Sisay Fissaha Adafre Josef van Genabith National Center for Language Technology School of Computing, DCU IBM CAS Dublin sadafre,josef@computing.dcu.ie Abstract We
More information