A Primer on Text Analytics



Similar documents
POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Automating Legal Research through Data Mining

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Text Mining with R Twitter Data Analysis 1

Technical Report. The KNIME Text Processing Feature:

Survey of review spam detection using machine learning techniques

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations

Index. 344 Grammar and Language Workbook, Grade 8

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

MONITORING URBAN TRAFFIC STATUS USING TWITTER MESSAGES

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Terminology Extraction from Log Files

Search and Information Retrieval

Introduction to basic Text Mining in R.

Building a Question Classifier for a TREC-Style Question Answering System

Customer Intentions Analysis of Twitter Based on Semantic Patterns

We have recently developed a video database

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Stock Market Prediction Using Data Mining

Big Data Analytics and Healthcare

IT services for analyses of various data samples

Third Grade Language Arts Learning Targets - Common Core

How To Write A Blog Post In R

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

Word Completion and Prediction in Hebrew

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

Natural Language Processing

Classification of Documents using Text Mining Package tm

Text Analytics. A business guide

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Terminology Extraction from Log Files

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

An online semi automated POS tagger for. Presented By: Pallav Kumar Dutta Indian Institute of Technology Guwahati

Information Retrieval Elasticsearch

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Search Engines. Stephen Shaw 18th of February, Netsoc

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Data Pre-Processing in Spam Detection

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Writing Common Core KEY WORDS

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

SIMOnt: A Security Information Management Ontology Framework

Why is Internal Audit so Hard?

How To Write A Dissertation

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Large-Scale PatternBased Information

English Appendix 2: Vocabulary, grammar and punctuation

Sentiment Analysis of Equities using Data Mining Techniques and Visualizing the Trends

Cross-lingual Synonymy Overlap

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

4.5 Symbol Table Applications

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Natural Language Processing: A Model to Predict a Sequence of Words

Automatic Text Analysis Using Drupal

ConText: Network Generation from Text Data. Jana Diesner Bochum, Apr 2015

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

PoS-tagging Italian texts with CORISTagger

Creating Usable Customer Intelligence from Social Media Data:

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Statistical Feature Selection Techniques for Arabic Text Categorization

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Active Learning SVM for Blogs recommendation

Foundations of Business Intelligence: Databases and Information Management

Albert Pye and Ravensmere Schools Grammar Curriculum

Cambridge Primary English as a Second Language Curriculum Framework

Introduction to Text Mining. Module 2: Information Extraction in GATE

Information Retrieval, Information Extraction and Social Media Analytics

Adobe Semantic Analysis Platform

Web Document Clustering

Sentiment analysis on tweets in a financial domain

BILINGUAL TRANSLATION SYSTEM

Twitter Sentiment Analysis

MODULE 15 Diagram the organizational structure of your company.

Natural Language Processing in the EHR Lifecycle

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

HealthTrust: Assessing the Trustworthiness of Healthcare Information on the Internet. Meeyoung Park

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Text Mining - Scope and Applications

MESLEKİ İNGİLİZCE I / VOCATIONAL ENGLISH I

Flattening Enterprise Knowledge

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015

Twitter sentiment vs. Stock price!

Transcription:

A Primer on Text Analytics From the Perspective of a Statistics Graduate Student Bradley Turnbull Statistical Learning Group North Carolina State University March 27, 2015 Bradley Turnbull Text Analytics March 27, 2015 1 / 41

What is Text Analytics? A set of linguistic, analytic, and predictive techniques to extract structure and meaning from unstructured documents The objective of Text Mining is to exploit information contained in textual documents in various ways, including... discovery of patterns and trends in data, associations among entities, predictive rules, etc. (Grobelnik et al., 2001) Bradley Turnbull Text Analytics March 27, 2015 2 / 41

What is Text Analytics? A set of linguistic, analytic, and predictive techniques to extract structure and meaning from unstructured documents The objective of Text Mining is to exploit information contained in textual documents in various ways, including... discovery of patterns and trends in data, associations among entities, predictive rules, etc. (Grobelnik et al., 2001) If we get to the root of it... Data analytics where some or all of the data is in an unstructured format Bradley Turnbull Text Analytics March 27, 2015 2 / 41

What is Text Analytics? Bradley Turnbull Text Analytics March 27, 2015 3 / 41

Why Text Analytics? Most of the World s data is in an unstructured format (80%) Email and Messages Webpages Social Media (Twitter, Facebook, Blogs) Surveys, feedback forms Scientific literature, books, and legal documents Bradley Turnbull Text Analytics March 27, 2015 4 / 41

Why Text Analytics? Most of the World s data is in an unstructured format (80%) Email and Messages Webpages Social Media (Twitter, Facebook, Blogs) Surveys, feedback forms Scientific literature, books, and legal documents Data is information, NOT just numbers in structured fields! Bradley Turnbull Text Analytics March 27, 2015 4 / 41

Where is it Applied? Intelligence and Law Enforcement Life Sciences and Clinical Medicine Social Media Analysis and Contextual Advertising Competitive Intelligence Product Management and Marketing Public Administration and Policy Recruiting Bradley Turnbull Text Analytics March 27, 2015 5 / 41

Major Challenges Natural human language can be ambiguous, subtle, and full of nuance Bradley Turnbull Text Analytics March 27, 2015 6 / 41

Major Challenges Natural human language can be ambiguous, subtle, and full of nuance different words with the same meaning (Synonymy) same word with different meanings (Homonomy) Bradley Turnbull Text Analytics March 27, 2015 6 / 41

Major Challenges Natural human language can be ambiguous, subtle, and full of nuance different words with the same meaning (Synonymy) same word with different meanings (Homonomy) Context is needed to clarify Bradley Turnbull Text Analytics March 27, 2015 6 / 41

Major Challenges Natural human language can be ambiguous, subtle, and full of nuance different words with the same meaning (Synonymy) same word with different meanings (Homonomy) Context is needed to clarify Natural Language Processing an entire lecture (or even course) unto itself Bradley Turnbull Text Analytics March 27, 2015 6 / 41

I Need to Read that Again... Juvenile Court to Try Shooting Defendant Kids Make Nutritious Snacks Local High School Dropouts Cut in Half Obesity Study Looks for Larger Test Group Red Tape Holds up New Bridges Bradley Turnbull Text Analytics March 27, 2015 7 / 41

The Corporate Flow Chart Data Collection & Data Cleaning Data Pre- Processing Advanced Analytics/Data Analysis Insights I m getting a PhD so I hopefully don t have to do this step Bradley Turnbull Text Analytics March 27, 2015 8 / 41

My Thoughts... Data Collection & Data Cleaning Data Pre- Processing Advanced Analytics/Data Analysis Insights I m getting a PhD so I hopefully don t have to do this step Bradley Turnbull Text Analytics March 27, 2015 9 / 41

My Thoughts... Data Collection & Data Cleaning Data Pre- Processing Advanced Analytics/Data Analysis Insights I m getting a PhD so I hopefully don t have to do this step Bringing some structure to my life Bradley Turnbull Text Analytics March 27, 2015 10 / 41

My Thoughts... Data Collection & Data Cleaning Data Pre- Processing Advanced Analytics/Data Analysis Insights I m getting a PhD so I hopefully don t have to do this step Bringing some structure to my life This is why I m 27 and still in school Bradley Turnbull Text Analytics March 27, 2015 11 / 41

Some Free Open Source Pre-Processing Software Tools R packages: tm, opennlp, Rweka, SnowballC, korpus, RTextTools, Rstem, wordnet, qdap Java Software: WEKA, Carrot2 Python Software: NLTK, Gensim, CLiPS Pattern Even More Options: GATE, KNIME, RapidMiner Bradley Turnbull Text Analytics March 27, 2015 12 / 41

Some Free Open Source Pre-Processing Software Tools R packages: tm, opennlp, Rweka, SnowballC, korpus, RTextTools, Rstem, wordnet, qdap Java Software: WEKA, Carrot2 Python Software: NLTK, Gensim, CLiPS Pattern Even More Options: GATE, KNIME, RapidMiner We will of course focus on R implementation Bradley Turnbull Text Analytics March 27, 2015 12 / 41

Let s Make Some Data library ( tm) mysents <- c(" Why didn t the Seahawks run the ball from the one yard line?!", " The Steelers ran the ball from the 2 yard line, but Jerome Bettis fumbled.") mycorp <- Corpus ( VectorSource ( mysents ) ) mycorp [[1]] << PlainTextDocument >> Why didn t the Seahawks run the ball from the one yard line?! mycorp [[2]] << PlainTextDocument >> The Steelers ran the ball from the 2 yard line, but Jerome Bettis fumbled. Bradley Turnbull Text Analytics March 27, 2015 13 / 41

Let s Make Some Data library ( tm) mysents <- c(" Why didn t the Seahawks run the ball from the one yard line?!", " The Steelers ran the ball from the 2 yard line, but Jerome Bettis fumbled.") mycorp <- Corpus ( VectorSource ( mysents ) ) mycorp [[1]] << PlainTextDocument >> Why didn t the Seahawks run the ball from the one yard line?! mycorp [[2]] << PlainTextDocument >> The Steelers ran the ball from the 2 yard line, but Jerome Bettis fumbled. If data from csv file: mytable <- read. table (" somefile. csv ",sep =",") mycorp <- Corpus ( DataframeSource ( mytable ) ) Bradley Turnbull Text Analytics March 27, 2015 13 / 41

Major Preprocessing Steps 1. Tokenization and Cleaning 2. Dimension Reduction 3. Frequencies Bradley Turnbull Text Analytics March 27, 2015 14 / 41

Data Cleaning Steps 1. Tokenization - convert streams of characters into words Main clue in English is white space Special characters can make things difficult: Dr. O Malley 2. Text Normalization - make all lowercase Bradley Turnbull Text Analytics March 27, 2015 15 / 41

Text Normalization in R The tm_map function applies the tm package functions across all documents in the corpus, similar to lapply mycorp <- tm_ map ( mycorp, tolower ) mycorp [[1]]; mycorp [[2]] [1] " why didn t the seahawks run the ball from the one yard line?!" [1] " the steelers ran the ball from the 2 yard line, but jerome bettis fumbled." Bradley Turnbull Text Analytics March 27, 2015 16 / 41

Data Cleaning Steps 1. Tokenization - convert streams of characters into words Main clue in English is white space Special characters can make things difficult: Dr. O Malley 2. Text Normalization - make all lowercase 3. Spell Checking Dictionary look up tables Bayesian spell checker: arg max c C Pr(w c) Pr(c) Edit distance - deletion, transposition (swap), alteration, insertion Bradley Turnbull Text Analytics March 27, 2015 17 / 41

Data Cleaning Steps 1. Tokenization - convert streams of characters into words Main clue in English is white space Special characters can make things difficult: Dr. O Malley 2. Text Normalization - make all lowercase 3. Spell Checking Dictionary look up tables Bayesian spell checker: arg max c C Pr(w c) Pr(c) Edit distance - deletion, transposition (swap), alteration, insertion 4. Part of Speech Tagging Bradley Turnbull Text Analytics March 27, 2015 17 / 41

Part of Speech Tagging Remember diagraming sentences in grade/middle school? Bradley Turnbull Text Analytics March 27, 2015 18 / 41

Part of Speech Tagging Remember diagraming sentences in grade/middle school? subject, predicate (verb), direct object, adjective, adverb,... Bradley Turnbull Text Analytics March 27, 2015 18 / 41

Part of Speech Tagging Remember diagraming sentences in grade/middle school? subject, predicate (verb), direct object, adjective, adverb,... How hard is it? Bradley Turnbull Text Analytics March 27, 2015 18 / 41

Part of Speech Tagging Remember diagraming sentences in grade/middle school? subject, predicate (verb), direct object, adjective, adverb,... How hard is it? Approx. 89% of English words have only one part of speech However, many common words in English are ambiguous Bradley Turnbull Text Analytics March 27, 2015 18 / 41

Part of Speech Tagging Remember diagraming sentences in grade/middle school? subject, predicate (verb), direct object, adjective, adverb,... How hard is it? Approx. 89% of English words have only one part of speech However, many common words in English are ambiguous Taggers can be rule-based, use probability models, or combination - Brill Tagger, from Eric Brill s 1995 PhD Thesis Process is similar to writing a compiler for programming language Bradley Turnbull Text Analytics March 27, 2015 18 / 41

Part of Speech Tagging in R Use the opennlp package, library (" opennlp ") tagpos <- function ( x){ s <- as. String (x) word _ token _ annotator <- Maxent _ Word _ Token _ Annotator () a2 <- Annotation (1L, " sentence ", 1L, nchar ( s)) a2 <- annotate (s, word _ token _ annotator, a2) a3 <- annotate (s, Maxent _ POS _ Tag _ Annotator (), a2) a3w <- a3[a3$ type == " word "] POStags <- unlist ( lapply ( a3w $ features, [[, " POS ")) paste ( sprintf ("%s/%s", s[ a3w ], POStags ), collapse = " ") } mycorp2 <- tm_ map ( mycorp, tagpos ) mycorp2 [[1]]; mycorp2 [[2]] [1] " why / WRB did / VBD n t/rb the /DT seahawks / NNS run / VBP the /DT ball /NN from /IN the /DT one /CD yard /NN line /NN?/.!/." [1] " the /DT steelers / NNS ran / VBD the /DT ball /NN from /IN the /DT 2/CD yard /NN line /NN,/, but /CC jerome /NN bettis /NN fumbled / VBD./." Bradley Turnbull Text Analytics March 27, 2015 19 / 41

Part of Speech Labels - Penn English Treebank CC = Coordinating conjunction CD = Cardinal number DT = Determiner EX = Existential there FW = Foreign word IN = Preposition or subordinating conjunction JJ = Adjective JJR = Adjective, comparative JJS = Adjective, superlative LS = List item marker MD = Modal NN = Noun, singular or mass NNS = Noun, plural NNP = Proper noun, singular NNPS = Proper noun, plural PDT = Predeterminer POS = Possessive ending PRP = Personal pronoun PRP$ = Possessive pronoun RB = Adverb RBR = Adverb, comparative RBS = Adverb, superlative RP = Particle SYM = Symbol TO = to UH = Interjection VB = Verb, base form VBD = Verb, past tense VBG = Verb, gerund or present participle VBN = Verb, past participle VBP = Verb, non3rd person singular present VBZ = Verb, 3rd person singular present WDT = Whdeterminer WP = Whpronoun WP$ = Possessive Whpronoun WRB = Whadverb Bradley Turnbull Text Analytics March 27, 2015 20 / 41

Dimension Reduction Bradley Turnbull Text Analytics March 27, 2015 21 / 41

Dimension Reduction 1. Remove Stop Words - words that rarely provide useful information examples: a, and, of, the, to, etc. 2. Remove Punctuation Bradley Turnbull Text Analytics March 27, 2015 21 / 41

Remove Stop Words and Punctuation in R mycorp [[1]]; mycorp [[2]] [1] " Why didn t the Seahawks run the ball from the one yard line?!" [1] " The Steelers ran the ball from the 2 yard line, but Jerome Bettis fumbled." mycorp <- tm_ map ( mycorp, removewords, stopwords (" english ")) mycorp <- tm_ map ( mycorp, removepunctuation ) mycorp [[1]]; mycorp [[2]] [1] " seahawks run ball one yard line " [1] " steelers ran ball 2 yard line jerome bettis fumbled " Bradley Turnbull Text Analytics March 27, 2015 22 / 41

Dimension Reduction 1. Remove Stop Words - words that rarely provide useful information examples: a, and, of, the, to, etc. 2. Remove Punctuation 3. Stemming - unifies variations of the same idea Remove plurals Normalize verb tenses Reduce word to most basic element Bradley Turnbull Text Analytics March 27, 2015 23 / 41

Dimension Reduction 1. Remove Stop Words - words that rarely provide useful information examples: a, and, of, the, to, etc. 2. Remove Punctuation 3. Stemming - unifies variations of the same idea Remove plurals Normalize verb tenses Reduce word to most basic element walking, walks, walked walk apply, applications, reapplied apply Bradley Turnbull Text Analytics March 27, 2015 23 / 41

Stemming General Approaches: Lookup tables Suffix-stripping Lemmatisation - uses part of speech of word Probability models (same idea as spell-checker) Bradley Turnbull Text Analytics March 27, 2015 24 / 41

Stemming General Approaches: Lookup tables Suffix-stripping Lemmatisation - uses part of speech of word Probability models (same idea as spell-checker) Two popular stemming algorithms Lovins Stemmer (Julie Beth Lovins 1968) Porter Stemmer (Martin Porter 1980) Bradley Turnbull Text Analytics March 27, 2015 24 / 41

Stemming General Approaches: Lookup tables Suffix-stripping Lemmatisation - uses part of speech of word Probability models (same idea as spell-checker) Two popular stemming algorithms Lovins Stemmer (Julie Beth Lovins 1968) Porter Stemmer (Martin Porter 1980) Both Lovins and Porter are rule-based suffix-stripping algorithms Lovins - 3 steps Porter - 6 steps Bradley Turnbull Text Analytics March 27, 2015 24 / 41

Stemming in R stemme <- Corpus ( VectorSource ( c( " walking walks walked ", " apply applied applications reapplied ", " fishing fished fisher "))) # Porter stemme1 <- tm_ map ( stemme, stemdocument, " english ") stemme1 [[1]]; stemme1 [[2]]; stemme1 [[3]] [1] " walk walk walk " [1] " appli appli applic reappli " [1] " fish fish fisher " # Lovins stemme2 <- tm_ map ( stemme, RWeka :: IteratedLovinsStemmer ) stemme2 [[1]]; stemme2 [[2]]; stemme2 [[3]] [1] " walking walks walked " [1] " apply applied applications reappl " [1] " fishing fished f" Bradley Turnbull Text Analytics March 27, 2015 25 / 41

Dimension Reduction 1. Remove Stop Words - words that rarely provide useful information examples: a, and, of, the, to, etc. 2. Remove Punctuation 3. Stemming - unifies variations of the same idea Remove plurals Normalize verb tenses Reduce word to most basic element walking, walks, walked walk apply, applications, reapplied apply 4. Apply Thesauri - synonyms Can be a very involved process Subject specific thesauri are best (may have to build your own) Bradley Turnbull Text Analytics March 27, 2015 26 / 41

Dimension Reduction 1. Remove Stop Words - words that rarely provide useful information examples: a, and, of, the, to, etc. 2. Remove Punctuation 3. Stemming - unifies variations of the same idea Remove plurals Normalize verb tenses Reduce word to most basic element walking, walks, walked walk apply, applications, reapplied apply 4. Apply Thesauri - synonyms Can be a very involved process Subject specific thesauri are best (may have to build your own) WordNet - large robust database of the English language compiled by Princeton University Linguistics department (FREE) Bradley Turnbull Text Analytics March 27, 2015 26 / 41

Synonyms in R replacewords <- function ( object, words, by ) { pattern <- paste ( words, collapse = " ") gsub ( pattern, by, object ) } mycorp2 <- tm_ map ( mycorp, replacewords, words =c(" seahawks "," steelers "), by=" football team ") mycorp2 [[1]]; mycorp2 [[2]] [1] " football team run ball one yard line " [1] " football team ran ball 2 yard line jerome bettis fumbled " mycorp <- tm_map ( mycorp, replacewords, words =c(" ran "), by=" run ") mycorp [[1]]; mycorp [[2]] [1] " seahawks run ball one yard line " [1] " steelers run ball 2 yard line jerome bettis fumbled " library (" wordnet ") synonyms (" company ", pos =" NOUN ") [1] " caller " " companionship " " company " " fellowship " [5] " party " "ship s company " " society " " troupe " synonyms (" data ", pos =" NOUN ") [1] " data " " information " Bradley Turnbull Text Analytics March 27, 2015 27 / 41

Synonyms in R # Very robust synonyms (" run ", pos =" VERB ") [1] "be given " " black market " " bleed " [4] " break away " " bunk " " campaign " [7] " carry " " consort " " course " [10] " die hard " " draw " " endure " [13] " escape " " execute " " extend " [16] " feed " " flow " " fly the coop " [19] " function " "go" " guide " [22] " head for the hills " " hightail it" " hunt " [25] " hunt down " " incline " " ladder " [28] " lam " " lead " " lean " [31] " melt " " melt down " " move " [34] " operate " " pass " " persist " [37] " play " " ply " " prevail " [40] " race " " range " " run " [43] " run away " " run for " " scarper " [46] " scat " " take to the woods " " tend " [49] " track down " " turn tail " " unravel " [52] " work " Bradley Turnbull Text Analytics March 27, 2015 28 / 41

Frequencies Bradley Turnbull Text Analytics March 27, 2015 29 / 41

Frequencies Characterizing the subject matter of documents by counting terms/words in and across documents Bradley Turnbull Text Analytics March 27, 2015 29 / 41

Frequencies Characterizing the subject matter of documents by counting terms/words in and across documents Going back to the streets Bradley Turnbull Text Analytics March 27, 2015 29 / 41

Frequencies Term Frequency (TF) - Number times term occurs in a document Bradley Turnbull Text Analytics March 27, 2015 30 / 41

Frequencies Term Frequency (TF) - Number times term occurs in a document Assumes if term occurs more often, it measures something important 2 times as many occurrences 2 times as important, common fix - use log transform, log(1 + TF) Bradley Turnbull Text Analytics March 27, 2015 30 / 41

Frequencies Term Frequency (TF) - Number times term occurs in a document Assumes if term occurs more often, it measures something important 2 times as many occurrences 2 times as important, common fix - use log transform, log(1 + TF) Document Frequency (DF) - Number of documents term occurs in Bradley Turnbull Text Analytics March 27, 2015 30 / 41

Frequencies Term Frequency (TF) - Number times term occurs in a document Assumes if term occurs more often, it measures something important 2 times as many occurrences 2 times as important, common fix - use log transform, log(1 + TF) Document Frequency (DF) - Number of documents term occurs in Assumes terms that occur in fewer documents are more specified to a document and more descriptive of the content in that document - i.e. rarity more important and descriptive Terms that occur in a lot of documents are common words, not as descriptive Is this true? - may just reflect synonym variations, regional differences, or personal style Bradley Turnbull Text Analytics March 27, 2015 30 / 41

Frequencies Inverse Document Frequency (IDF) DF - smaller is better, invert so bigger is better Often too severe if we simply invert Bradley Turnbull Text Analytics March 27, 2015 31 / 41

Frequencies Inverse Document Frequency (IDF) DF - smaller is better, invert so bigger is better Often too severe if we simply invert IDF = log(1 + 1/DF) Bradley Turnbull Text Analytics March 27, 2015 31 / 41

Frequencies Inverse Document Frequency (IDF) DF - smaller is better, invert so bigger is better Often too severe if we simply invert IDF = log(1 + 1/DF) Term Frequency Inverse Document Frequency (TF-IDF) Bradley Turnbull Text Analytics March 27, 2015 31 / 41

Frequencies Inverse Document Frequency (IDF) DF - smaller is better, invert so bigger is better Often too severe if we simply invert IDF = log(1 + 1/DF) Term Frequency Inverse Document Frequency (TF-IDF) TF-IDF = TF IDF Higher frequency of terms that are rare indicate a very important concept Often standardize TF for each document TF/(Total Number of Terms in Doc) Bradley Turnbull Text Analytics March 27, 2015 31 / 41

Multi-Word Frequencies: N-Grams Count combinations of adjacent words 2-grams (bigrams, digrams) - vice president, not happy, statistics department 3-grams (trigrams) - central intelligence agency 4-grams - united states of america, laboratory for analytic sciences Bradley Turnbull Text Analytics March 27, 2015 32 / 41

Multi-Word Frequencies: N-Grams Count combinations of adjacent words 2-grams (bigrams, digrams) - vice president, not happy, statistics department 3-grams (trigrams) - central intelligence agency 4-grams - united states of america, laboratory for analytic sciences Can simply enumerate all n-grams and then keep those which meet a frequency threshold Sophisticated methods use probability models and/or allow gaps between words Bradley Turnbull Text Analytics March 27, 2015 32 / 41

Document Term Matrix Bradley Turnbull Text Analytics March 27, 2015 33 / 41

Document Term Matrix The structured design matrix we ve been yearning for rows - documents columns - words/terms (and n-grams) cells populated with TF, TF-IDF, or some custom variation Bradley Turnbull Text Analytics March 27, 2015 33 / 41

Document Term Matrix The structured design matrix we ve been yearning for rows - documents columns - words/terms (and n-grams) cells populated with TF, TF-IDF, or some custom variation often large and sparse can concatenate with other structured data Term Document Matrix - simply the transpose Bradley Turnbull Text Analytics March 27, 2015 33 / 41

DTM in R mycorp <- tm_ map ( mycorp, stripwhitespace ) mycorp <- tm_ map ( mycorp, PlainTextDocument ) dtm <- DocumentTermMatrix ( mycorp ) inspect ( dtm ) << DocumentTermMatrix ( documents : 2, terms : 11) >> Non -/ sparse entries : 15/7 Sparsity : 32% Maximal term length : 8 Weighting : term frequency ( tf) Terms Docs ball bettis fumbled jerome line one run seahawks 1 1 0 0 0 1 1 1 1 2 1 1 1 1 1 0 1 0 Terms Docs steelers two yard 1 0 0 1 2 1 1 1 Bradley Turnbull Text Analytics March 27, 2015 34 / 41

DTM in R - Minimum Word Frequency dtm2 <- DocumentTermMatrix ( mycorp, control = list ( bounds = list ( global =c(2, Inf )))) inspect ( dtm2 ) << DocumentTermMatrix ( documents : 2, terms : 4) >> Non -/ sparse entries : 8/0 Sparsity : 0% Maximal term length : 4 Weighting : term frequency ( tf) Terms Docs ball line run yard 1 1 1 1 1 2 1 1 1 1 Bradley Turnbull Text Analytics March 27, 2015 35 / 41

DTM in R - Minimum Word Frequency dtm2 <- DocumentTermMatrix ( mycorp, control = list ( bounds = list ( global =c(2, Inf )))) inspect ( dtm2 ) << DocumentTermMatrix ( documents : 2, terms : 4) >> Non -/ sparse entries : 8/0 Sparsity : 0% Maximal term length : 4 Weighting : term frequency ( tf) Terms Docs ball line run yard 1 1 1 1 1 2 1 1 1 1 # Another option # Max 50% sparsity dtm3 <- removesparseterms ( dtm, 0.5 ) dim ( dtm3 ) [1] 2 4 Bradley Turnbull Text Analytics March 27, 2015 35 / 41

Weighted DTM in R wdtm <- weighttfidf ( dtm ) inspect ( wdtm ) << DocumentTermMatrix ( documents : 2, terms : 11) >> Non -/ sparse entries : 7/15 Sparsity : 68% Maximal term length : 8 Weighting : term frequency - inverse document frequency ( normalized ) (tf - idf ) Terms Docs ball bettis fumbled jerome line one run 1 0 0.0000000 0.0000000 0.0000000 0 0.1666667 0 2 0 0.1111111 0.1111111 0.1111111 0 0.0000000 0 Terms Docs seahawks steelers two yard 1 0.1666667 0.0000000 0.0000000 0 2 0.0000000 0.1111111 0.1111111 0 Bradley Turnbull Text Analytics March 27, 2015 36 / 41

N-Grams in R bigramtoken <- function ( x){ RWeka :: NGramTokenizer (x, RWeka :: Weka _ control ( min =1, max =2) ) } dtm. bigram <- DocumentTermMatrix ( mycorp, control = list ( tokenize = bigramtoken )) inspect ( dtm. bigram ) << DocumentTermMatrix ( documents : 2, terms : 22) >> Non -/ sparse entries : 28/16 Sparsity : 36% Maximal term length : 14 Weighting : term frequency ( tf) Terms Docs ball ball one ball two bettis bettis fumbled fumbled jerome 1 1 1 0 0 0 0 0 2 1 0 1 1 1 1 1 Docs jerome bettis line line jerome one one yard run run ball 1 0 1 0 1 1 1 1 2 1 1 1 0 0 1 1 Docs seahawks seahawks run steelers steelers run two two yard 1 1 1 0 0 0 0 2 0 0 1 1 1 1 Docs yard yard line 1 1 1 2 1 1 Bradley Turnbull Text Analytics March 27, 2015 37 / 41

Training and Test Data # Training Data spam. train <- Corpus ( VectorSource ( c("i am spam ", " buy this item ", " party this weekend ", " want to get coffee ", " best product deal ever ", " get outside this weekend ", " beach trip details ", " sale of the century ", " buy buy buy ", " bovine steroid pills ") )) # Test Data spam. test <- Corpus ( VectorSource ( c(" buy more of this item on sale ", " beach trip room reservation ", " buy it all ", " deal of the century ") )) Bradley Turnbull Text Analytics March 27, 2015 38 / 41

Training and Test Data dtm. train <- DocumentTermMatrix ( spam. train, control = list ( stopwords =T)) Terms ( dtm. train ) [1] " beach " " best " " bovine " " buy " " century " " coffee " [7] " deal " " details " " ever " " get " " item " " outside " [13] " party " " pills " " product " " sale " " spam " " steroid " [19] " trip " " want " " weekend " Bradley Turnbull Text Analytics March 27, 2015 39 / 41

Training and Test Data dtm. test <- DocumentTermMatrix ( spam. test, control = list ( dictionary = Terms ( dtm. train ))) inspect ( dtm. test ) << DocumentTermMatrix ( documents : 4, terms : 21) >> Non -/ sparse entries : 8/76 Sparsity : 90% Maximal term length : 7 Weighting : term frequency ( tf) Terms Docs beach best bovine buy century coffee deal details ever get 1 0 0 0 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 1 0 0 0 Docs item outside party pills product sale spam steroid trip want 1 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 Docs weekend 1 0 2 0 3 0 4 0 Bradley Turnbull Text Analytics March 27, 2015 40 / 41

Next Steps Topic Identification - clustering documents Document Classification - training classification model based on some response Sentiment Analysis - classification with positive and negative response variable Bradley Turnbull Text Analytics March 27, 2015 41 / 41

Next Steps Topic Identification - clustering documents Document Classification - training classification model based on some response Sentiment Analysis - classification with positive and negative response variable Often performed by simply summing the weights of words from a sentiment dictionary Bradley Turnbull Text Analytics March 27, 2015 41 / 41

Next Steps Topic Identification - clustering documents Document Classification - training classification model based on some response Sentiment Analysis - classification with positive and negative response variable Often performed by simply summing the weights of words from a sentiment dictionary Related Topics Latent Dirichlet Allocation Models (LDA) Temporal Theme Analysis Latent Semantic Analysis Bradley Turnbull Text Analytics March 27, 2015 41 / 41