Text Mining with R Rob Zinkov October 19th, 2010 Rob Zinkov () Text Mining with R October 19th, 2010 1 / 38
Outline 1 Introduction 2 Readability 3 Summarization 4 Topic Modeling 5 Sentiment Analysis 6 Entity Extraction 7 Demo Rob Zinkov () Text Mining with R October 19th, 2010 2 / 38
What is Text Mining? Introduction Text mining is any process or program that: Raw human written text Structured information Rob Zinkov () Text Mining with R October 19th, 2010 3 / 38
R is very good for this Introduction Rob Zinkov () Text Mining with R October 19th, 2010 4 / 38
Introduction Themes R is a great glue language CRAN already has a lots of packages that work well together Rob Zinkov () Text Mining with R October 19th, 2010 5 / 38
Introduction Caveats I use outside libraries more than necessary Many of these algorithms could be written completely in R There are nicer ways to integrate these libraries Text mining is a vast field that can t be covered in 40 minutes Rob Zinkov () Text Mining with R October 19th, 2010 6 / 38
Readability Readability Rob Zinkov () Text Mining with R October 19th, 2010 7 / 38
Readability Readability gives us an idea of the difficulty of the document It also gives a rough measure of the quality Rob Zinkov () Text Mining with R October 19th, 2010 8 / 38
Readability Flesch-Kincaid readability test Readability can be roughly measured with ( w ) ( y ) 206.876 1.015 84.6 s w where w = total words y = total syllables s = total sentences Rob Zinkov () Text Mining with R October 19th, 2010 9 / 38
Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0 easily understandable by 13- to 15-year-old students 0.0-30.0 best understood by university graduates Rob Zinkov () Text Mining with R October 19th, 2010 10 / 38
Readability Flesch-Kincaid Reading Age (0.39 ASL) + (11.8 ASW ) 15.59 where ASL = average sentence length where ASW = average syllables per word Rob Zinkov () Text Mining with R October 19th, 2010 11 / 38
Readability Rob Zinkov () Text Mining with R October 19th, 2010 12 / 38
Readability system(paste("java -jar CmdFlesh.jar","test_review.txt")) Rob Zinkov () Text Mining with R October 19th, 2010 13 / 38
Readability Demo! Rob Zinkov () Text Mining with R October 19th, 2010 14 / 38
Notes Readability This algorithm isn t hard to implement. Quick trick. Count vowel clusters in words to estimate syllables Rob Zinkov () Text Mining with R October 19th, 2010 15 / 38
Summarization Summarization is about a succient distinct distillation of the relevant content in a document. Rob Zinkov () Text Mining with R October 19th, 2010 16 / 38
Summarization Rob Zinkov () Text Mining with R October 19th, 2010 17 / 38
Summarization Most approaches involve selecting out the most relevant sentences The simplest technique is to just look for sentences with popular terms Rob Zinkov () Text Mining with R October 19th, 2010 18 / 38
Summarization I use libots, as an example but there is much better work Rob Zinkov () Text Mining with R October 19th, 2010 19 / 38
Summarization Demo! Rob Zinkov () Text Mining with R October 19th, 2010 20 / 38
Topic Modeling Topic Modeling Topic Modeling is a way to group and categorize documents Usually unsupervised approach Rob Zinkov () Text Mining with R October 19th, 2010 21 / 38
Topic Modeling Topic Modeling - continued CRAN includes a package for topicmodeling This package using LDA and CTM Rob Zinkov () Text Mining with R October 19th, 2010 22 / 38
Topic Modeling LDA - Latent Dirchilet Allocation θ k j D[α] φ w k D[β] z ij θ k j x ij φ w zij Rob Zinkov () Text Mining with R October 19th, 2010 23 / 38
Topic Modeling n jkw = #{i : x ij = w, z ij = k} Rob Zinkov () Text Mining with R October 19th, 2010 24 / 38
Topic Modeling CTM - Coorelated Topic Models Rob Zinkov () Text Mining with R October 19th, 2010 25 / 38
Topic Modeling CTM - Coorelated Topic Models θ k j log(n(µ, Σ)) φ w k D[β] z ij θ k j x ij φ w zij Rob Zinkov () Text Mining with R October 19th, 2010 26 / 38
Topic Modeling Demo! Rob Zinkov () Text Mining with R October 19th, 2010 27 / 38
Sentiment Analysis Rob Zinkov () Text Mining with R October 19th, 2010 28 / 38
Sentiment Analysis Sentiment analysis is about gauging mood based on the text. Rob Zinkov () Text Mining with R October 19th, 2010 29 / 38
Sentiment Analysis Opinion corpus available at: Wiebe s corpora http://www.cs.pitt.edu/mpqa/ Sentiwordnet: http://sentiwordnet.isti.cnr.it/ Rob Zinkov () Text Mining with R October 19th, 2010 30 / 38
Sentiment Analysis For more sophistication Best solved using a Conditional Random Field This area is still new No R libraries Entity Extraction needed for more fine-grained sentiment Rob Zinkov () Text Mining with R October 19th, 2010 31 / 38
Sentiment Analysis Demo! Rob Zinkov () Text Mining with R October 19th, 2010 32 / 38
Entity Extraction Named Entity Recognition The purpose of NER is to extract out and label phrases in a sentence Bill Clinton arrived at the United Nations Building in Manhattan. Rob Zinkov () Text Mining with R October 19th, 2010 33 / 38
Entity Extraction Challenges People and locations may be referred to in ambiguous ways. Entity may never have been seen before Entity may be referred to with pronouns Wikipedia and Capitalization heuristics aren t good enough Rob Zinkov () Text Mining with R October 19th, 2010 34 / 38
Entity Extraction I use the Illinois Named Entity Extractor http://cogcomp.cs.illinois.edu/page/software view/4 Rob Zinkov () Text Mining with R October 19th, 2010 35 / 38
Entity Extraction Demo! Rob Zinkov () Text Mining with R October 19th, 2010 36 / 38
Conclusions Demo There are lots of interesting things you can do with text mining R is very good at integrating all of them. Rob Zinkov () Text Mining with R October 19th, 2010 37 / 38
Demo Questions? Rob Zinkov () Text Mining with R October 19th, 2010 38 / 38