Machine Learning Natural Language Processing

Similar documents
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Review Jeopardy. Blue vs. Orange. Review Jeopardy

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

DEVELOPMENT OF NATURAL LANGUAGE INTERFACE TO RELATIONAL DATABASES

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Introduction to Principal Components and FactorAnalysis

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Word Completion and Prediction in Hebrew

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

31 Case Studies: Java Natural Language Tools Available on the Web

Linear Algebra Methods for Data Mining

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Points of Interference in Learning English as a Second Language

Linear Algebra Review. Vectors

Modeling coherence in ESOL learner texts

Introduction to Pattern Recognition

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

1 o Semestre 2007/2008

Projektgruppe. Categorization of text documents via classification

Why are Organizations Interested?

communication tool: Silvia Biffignandi

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Decision Support Systems

2014/02/13 Sphinx Lunch

The Data Mining Process

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Leveraging ASEAN Economic Community through Language Translation Services

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

SIMOnt: A Security Information Management Ontology Framework

Movie Classification Using k-means and Hierarchical Clustering

Collaborative Filtering. Radek Pelánek

Supervised Learning Evaluation (via Sentiment Analysis)!

Machine Learning Logistic Regression

Turkish Radiology Dictation System

PoS-tagging Italian texts with CORISTagger

Semantic analysis of text and speech

Specialty Answering Service. All rights reserved.

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015

Text Analytics. A business guide

Search Engines. Stephen Shaw 18th of February, Netsoc

Mining Text Data: An Introduction

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

1 Example of Time Series Analysis by SSA 1

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

An ontology-based approach for semantic ranking of the web search engines results

Information extraction from texts. Technical and business challenges

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Lecture 5: Singular Value Decomposition SVD (1)

Customizing an English-Korean Machine Translation System for Patent Translation *

CHAPTER 15: IS ARTIFICIAL INTELLIGENCE REAL?

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Automated Extraction of Security Policies from Natural-Language Software Documents

Technical Report. The KNIME Text Processing Feature:

: Introduction to Machine Learning Dr. Rita Osadchy

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Author Gender Identification of English Novels

Clustering Connectionist and Statistical Language Processing

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Text Analytics with Ambiverse. Text to Knowledge.

Search and Information Retrieval

A comparative analysis of the language used on labels of Champagne and Sparkling Water bottles.

Natural Language Database Interface for the Community Based Monitoring System *

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Query Recommendation employing Query Logs in Search Optimization

Finding Advertising Keywords on Web Pages. Contextual Ads 101

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Machine Learning and Statistics: What s the Connection?

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Data Mining Yelp Data - Predicting rating stars from review text

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Transcription:

Machine Learning Natural Language Processing Jeff Howbert Introduction to Machine Learning Winter 2014 1

IBM s Watson computer Watson is an open-domain question-answering system. In 2011 Watson beat the two highest ranked human players in a nationally televised two-game Jeopardy! match. http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6177717 Jeff Howbert Introduction to Machine Learning Winter 2014 2

Modeling document similarity Vector space models Latent semantic indexing Jeff Howbert Introduction to Machine Learning Winter 2014 3

Vector space models Vector of features for each document Word frequencies (usually weighted) Meta attributes, e.g. title, URL, PageRank Can use vector as document descriptor for classification tasks Can measure document relatedness by cosine similarity of vectors Useful for clustering, ranking tasks Jeff Howbert Introduction to Machine Learning Winter 2014 4

Vector space models Term frequency-inverse document frequency (tfidf) Very widely used model Each feature represents a single word (term) Feature value is product of: tf = proportion of counts of that term relative to all terms in document idf = log( total number of documents / number of documents that contain the term) Jeff Howbert Introduction to Machine Learning Winter 2014 5

Vector space models Example of tf-idf vectors for a collection of documents Jeff Howbert Introduction to Machine Learning Winter 2014 6

Vector space models Vector space models cannot detect: Synonymy multiple words with same meaning Polysemy single word with multiple meanings (e.g. play, table) Jeff Howbert Introduction to Machine Learning Winter 2014 7

Latent semantic indexing Aggregate document term vectors into matrix X Rows represent terms Columns represent documents Factorize X using singular value decomposition (SVD) X = T S D T Jeff Howbert Introduction to Machine Learning Winter 2014 8

Latent semantic indexing Factorize X using singular value decomposition (SVD) X = T S D T Columns of T are orthogonal and contain eigenvectors of XX T Columns of D are orthogonal and contain eigenvectors of X T X S is diagonal and contains singular values (analogous to eigenvalues) Rows of T represent terms in a new orthogonal space Rows of D represent documents in a new orthogonal space In the new orthogonal spaces of T and D, the correlations originally present in X are captured in separate dimensions Better exposes relationships among data items Identifies dimensions of greatest variation within data Jeff Howbert Introduction to Machine Learning Winter 2014 9

Latent semantic indexing Dimensional reduction with SVD Select k largest singular values and corresponding columns from T and D X k = T k S k D kt is a reduced rank reconstruction of full X Jeff Howbert Introduction to Machine Learning Winter 2014 10

Latent semantic indexing Dimensional reduction with SVD Reduced rank representations of terms and documents referred to as latent concepts Compare documents in lower dimensional space classification, clustering, matching, ranking Compare terms in lower dimensional space synonymy, polysemy, other cross-term semantic relationships Jeff Howbert Introduction to Machine Learning Winter 2014 11

Unsupervised Latent semantic indexing May not learn a matching score that works well for a task of interest Jeff Howbert Introduction to Machine Learning Winter 2014 12

Major tasks in NLP (Wikipedia) For most tasks on following slides, there are: Well-defined problem setting Large volume of research Standard metric for evaluating the task Standard corpora on which to evaluate Competitions devoted to the specific task http://en.wikipedia.org/wiki/natural_language_processing Jeff Howbert Introduction to Machine Learning Winter 2014 13

Machine translation Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly. Jeff Howbert Introduction to Machine Learning Winter 2014 14

Parse tree (grammatical analysis) The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. For a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). Jeff Howbert Introduction to Machine Learning Winter 2014 15

Part-of-speech tagging Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun or verb, and "out" can be any of at least five different parts of speech. Jeff Howbert Introduction to Machine Learning Winter 2014 16

Speech recognition Given a sound clip of a person or people speaking, determine the textual representation of the speech. Another extremely difficult problem, also regarded as "AIcomplete". In natural speech there are hardly any pauses between successive words, and thus speech segmentation (separation into words) is a necessary subtask of speech recognition. In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Jeff Howbert Introduction to Machine Learning Winter 2014 17

Speech recognition Hidden Markov model (HMM) for phoneme extraction https://www.assembla.com/code/sonido/subversion/node/blob/7/sphinx4/index.html Jeff Howbert Introduction to Machine Learning Winter 2014 18

Sentiment analysis Extract subjective information, usually from a set of documents like online reviews, to determine "polarity" about specific objects. Especially useful for identifying trends of public opinion in the social media, for the purpose of marketing. Jeff Howbert Introduction to Machine Learning Winter 2014 19

Information extraction (IE) Concerned with extraction of semantic information from text. Named entity recognition Coreference resolution Relationship extraction Word sense disambiguation Automatic summarization etc. Jeff Howbert Introduction to Machine Learning Winter 2014 20

Named entity recognition Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). In English, capitalization can aid in recognizing named entities, but cannot aid in determining the type of named entity, and in any case is often insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words. German capitalizes all nouns. French and Spanish do not capitalize names that serve as adjectives. Many languages (e.g. Chinese or Arabic) do not have capitalization at all. Jeff Howbert Introduction to Machine Learning Winter 2014 21

Coreference resolution Given a chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, concerned with matching up pronouns with the nouns or names that they refer to. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to). Jeff Howbert Introduction to Machine Learning Winter 2014 22

Relationship extraction Given a chunk of text, identify the relationships among named entities (e.g. who is the wife of whom). Jeff Howbert Introduction to Machine Learning Winter 2014 23

Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet. Jeff Howbert Introduction to Machine Learning Winter 2014 24

Automatic summarization Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. Jeff Howbert Introduction to Machine Learning Winter 2014 25

Natural language generation Convert information from computer databases into readable human language. Jeff Howbert Introduction to Machine Learning Winter 2014 26

Natural language understanding Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural languages semantics without confusions with implicit assumptions such as closed world assumption (CWA) vs. open world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. Jeff Howbert Introduction to Machine Learning Winter 2014 27

Optical character recognition (OCR) Given an image representing printed text, determine the corresponding text. Jeff Howbert Introduction to Machine Learning Winter 2014 28

Word segmentation Separate a chunk of continuous text into separate words. For English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese, and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Jeff Howbert Introduction to Machine Learning Winter 2014 29

Speech processing Speech recognition Text-to-speech Jeff Howbert Introduction to Machine Learning Winter 2014 30

Automated essay scoring Jeff Howbert Introduction to Machine Learning Winter 2014 31