Text Mining with R Twitter Data Analysis 1



Similar documents
How To Write A Blog Post In R

Data Mining with R. Text Mining. Hugh Murrell

Package wordcloud. R topics documented: February 20, 2015

Classification of Documents using Text Mining Package tm

Introduction to basic Text Mining in R.

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Using R for Social Media Analytics

Hands-On Data Science with R Text Mining. Graham.Williams@togaware.com. 5th November 2014 DRAFT

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Twitter Data Analysis: Hadoop2 Map Reduce Framework

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

2015 Workshops for Professors

Analytics on Big Data

Sunnie Chung. Cleveland State University

International Journal of Innovative Research in Computer and Communication Engineering

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Distributed Text Mining with tm

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Why is Internal Audit so Hard?

Better planning and forecasting with IBM Predictive Analytics

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Big Data and Data Science: Behind the Buzz Words

Course Title: Advanced Topics in Quantitative Methods: Educational Data Science Practicum

A Statistical Text Mining Method for Patent Analysis

How To Understand Data Mining In R And Rattle

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

R-Academy I Knowledge, that matters

Data Science with R. Introducing Data Mining with Rattle and R.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

ANALYTICS CENTER LEARNING PROGRAM

Tableau's data visualization software is provided through the Tableau for Teaching program.

Data Visualization with R Language

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

Patent Big Data Analysis by R Data Language for Technology Management

Web Document Clustering

What Are They Thinking? With Oracle Application Express and Oracle Data Miner

A Framework of User-Driven Data Analytics in the Cloud for Course Management

Exploring Big Data in Social Networks

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Package tagcloud. R topics documented: July 3, 2015

Package tm. July 3, 2015

PDF PREVIEW EMERGING TECHNOLOGIES. Applying Technologies for Social Media Data Analysis

Statistical Feature Selection Techniques for Arabic Text Categorization

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

IT services for analyses of various data samples

Distributed Computing and Big Data: Hadoop and MapReduce

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Why include analytics as part of the School of Information Technology curriculum?

Fast Analytics on Big Data with H20

L1: Introduction to Hadoop

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

NVivo 10 for Windows and NVivo for Mac

Investigating Clinical Care Pathways Correlated with Outcomes

How To Analyze Sentiment On A Microsoft Microsoft Twitter Account

Data Mining with Hadoop at TACC

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Advanced analytics at your hands

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Why are Organizations Interested?

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

a Data Science Univ. Piraeus [GR]

Big Data Spatial Analytics An Introduction

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

Data are everywhere. IBM projects that every day we generate 2.5

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data tells stories, and business analytics depends on data. Hearing the stories in the data, however, only happens when you have people with the

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Predictive Analytics Certificate Program

CSC 177 Fall 2014 Team Project Final Report

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Case Study : 3 different hadoop cluster deployments

Statistical text mining using R. Tom Liptrot The Christie Hospital

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Transcription:

Text Mining with R Twitter Data Analysis 1 Yanchang Zhao http://www.rdatamining.com R and Data Mining Workshop for the Master of Business Analytics course, Deakin University, Melbourne 28 May 2015 1 Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014 and at UJAT (Mexico) in Sept 2014 1 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 2 / 34

Text Mining unstructured text data text categorization text clustering entity extraction sentiment analysis document summarization... 3 / 34

Text mining of Twitter data with R 2 1. extract data from Twitter 2. clean extracted data and build a document-term matrix 3. find frequent words and associations 4. create a word cloud to visualize important words 5. text clustering 6. topic modelling 2 Chapter 10: Text Mining, R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/rdatamining.pdf 4 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 5 / 34

Retrieve Tweets Retrieve recent tweets by @RDataMining ## Option 1: retrieve tweets from Twitter library(twitter) tweets <- usertimeline("rdatamining", n = 3200) ## Option 2: download @RDataMining tweets from RDataMining.com url <- "http://www.rdatamining.com/data/rdmtweets-201306.rdata" download.file(url, destfile = "./data/rdmtweets-201306.rdata") ## load tweets into R load(file = "./data/rdmtweets-201306.rdata") 6 / 34

(n.tweet <- length(tweets)) ## [1] 320 # convert tweets to a data frame tweets.df <- twlisttodf(tweets) dim(tweets.df) ## [1] 320 14 for (i in c(1:2, 320)) { cat(paste0("[", i, "] ")) writelines(strwrap(tweets.df$text[i], 60)) } ## [1] Examples on calling Java code from R http://t.co/yg1aiv... ## [2] Simulating Map-Reduce in R for Big Data Analysis Using ## Flights Data http://t.co/uiah6pgvqv via @rbloggers ## [320] An R Reference Card for Data Mining is now available on ## CRAN. It lists many useful R functions and packages for ## data mining applications. 7 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 8 / 34

library(tm) # build a corpus, and specify the source to be character vectors mycorpus <- Corpus(VectorSource(tweets.df$text)) # convert to lower case # tm v0.6 mycorpus <- tm_map(mycorpus, content_transformer(tolower)) # tm v0.5-10 # mycorpus <- tm_map(mycorpus, tolower) # remove URLs removeurl <- function(x) gsub("http[^[:space:]]*", "", x) # tm v0.6 mycorpus <- tm_map(mycorpus, content_transformer(removeurl)) # tm v0.5-10 # mycorpus <- tm_map(mycorpus, removeurl) 9 / 34

# remove anything other than English letters or space removenumpunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) mycorpus <- tm_map(mycorpus, content_transformer(removenumpunct)) # remove punctuation # mycorpus <- tm_map(mycorpus, removepunctuation) # remove numbers # mycorpus <- tm_map(mycorpus, removenumbers) # add two extra stop words: "available" and "via" mystopwords <- c(stopwords('english'), "available", "via") # remove "r" and "big" from stopwords mystopwords <- setdiff(mystopwords, c("r", "big")) # remove stopwords from corpus mycorpus <- tm_map(mycorpus, removewords, mystopwords) # remove extra whitespace mycorpus <- tm_map(mycorpus, stripwhitespace) 10 / 34

# keep a copy of corpus to use later as a dictionary for stem completion mycorpuscopy <- mycorpus # stem words mycorpus <- tm_map(mycorpus, stemdocument) # inspect the first 5 documents (tweets) # inspect(mycorpus[1:5]) # The code below is used for to make text fit for paper width for (i in c(1:2, 320)) { cat(paste0("[", i, "] ")) writelines(strwrap(as.character(mycorpus[[i]]), 60)) } ## [1] exampl call java code r ## [2] simul mapreduc r big data analysi use flight data rblogger ## [320] r refer card data mine now cran list mani use r function ## packag data mine applic 11 / 34

# tm v0.5-10 # mycorpus <- tm_map(mycorpus, stemcompletion) # tm v0.6 stemcompletion2 <- function(x, dictionary) { x <- unlist(strsplit(as.character(x), " ")) # Unexpectedly, stemcompletion completes an empty string to # a word in dictionary. Remove empty string to avoid above issue. x <- x[x!= ""] x <- stemcompletion(x, dictionary=dictionary) x <- paste(x, sep="", collapse=" ") PlainTextDocument(stripWhitespace(x)) } mycorpus <- lapply(mycorpus, stemcompletion2, dictionary=mycorpuscopy) mycorpus <- Corpus(VectorSource(myCorpus)) ## [1] example call java code r ## [2] simulating mapreduce r big data analysis use flights data ## rbloggers ## [320] r reference card data miner now cran list use r function ## package data miner application 3 3 http://stackoverflow.com/questions/25206049/ 12 / 34 stemcompletion-is-not-working

# count frequency of "mining" miningcases <- lapply(mycorpuscopy, function(x) { grep(as.character(x), pattern = "\\<mining")} ) sum(unlist(miningcases)) ## [1] 82 # count frequency of "miner" minercases <- lapply(mycorpuscopy, function(x) {grep(as.character(x), pattern = "\\<miner")} ) sum(unlist(minercases)) ## [1] 5 # replace "miner" with "mining" mycorpus <- tm_map(mycorpus, content_transformer(gsub), pattern = "miner", replacement = "mining") 13 / 34

tdm <- TermDocumentMatrix(myCorpus, control = list(wordlengths = c(1, Inf))) tdm ## <<TermDocumentMatrix (terms: 822, documents: 320)>> ## Non-/sparse entries: 2460/260580 ## Sparsity : 99% ## Maximal term length: 27 ## Weighting : term frequency (tf) 14 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 15 / 34

idx <- which(dimnames(tdm)$terms == "r") inspect(tdm[idx + (0:5), 101:110]) ## <<TermDocumentMatrix (terms: 6, documents: 10)>> ## Non-/sparse entries: 4/56 ## Sparsity : 93% ## Maximal term length: 12 ## Weighting : term frequency (tf) ## ## Docs ## Terms 101 102 103 104 105 106 107 108 109 110 ## r 0 1 1 0 0 0 0 0 1 1 ## ramachandran 0 0 0 0 0 0 0 0 0 0 ## random 0 0 0 0 0 0 0 0 0 0 ## ranked 0 0 0 0 0 0 0 0 0 0 ## rann 0 0 0 0 0 0 0 0 0 0 ## rapidmining 0 0 0 0 0 0 0 0 0 0 16 / 34

# inspect frequent words (freq.terms <- findfreqterms(tdm, lowfreq = 15)) ## [1] "analysis" "application" "big" ## [4] "book" "code" "computational" ## [7] "data" "example" "group" ## [10] "introduction" "mining" "network" ## [13] "package" "position" "r" ## [16] "research" "see" "slides" ## [19] "social" "tutorial" "university" ## [22] "use" term.freq <- rowsums(as.matrix(tdm)) term.freq <- subset(term.freq, term.freq >= 15) df <- data.frame(term = names(term.freq), freq = term.freq) 17 / 34

library(ggplot2) ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") + xlab("terms") + ylab("count") + coord_flip() Terms use university tutorial social slides see research position r package network mining introduction group example data computational code book big application analysis 0 50 100 150 Count 18 / 34

# which words are associated with 'r'? findassocs(tdm, "r", 0.2) ## r ## example 0.33 ## code 0.29 # which words are associated with 'mining'? findassocs(tdm, "mining", 0.25) ## mining ## data 0.48 ## mahout 0.30 ## recommendation 0.30 ## sets 0.30 ## supports 0.30 ## frequent 0.27 ## itemset 0.26 19 / 34

library(graph) library(rgraphviz) plot(tdm, term = freq.terms, corthreshold = 0.1, weighting = T) use r group university computational package example tutorial social research mining see code network position data book analysis big application introduction slides 20 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 21 / 34

m <- as.matrix(tdm) # calculate the frequency of words and sort it by frequency word.freq <- sort(rowsums(m), decreasing = T) # colors pal <- brewer.pal(9, "BuGn") pal <- pal[-(1:4)] # plot word cloud library(wordcloud) wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, random.order = F, colors = pal) 22 / 34

r miningdata analysis example research package position slides use network university social big tutorial book computational application code introduction see group analytics australia modeling time associate free scientist job lecture online parallel postdoctoral text cluster course learn pdf postdoc rule series talk now program ausdm detection document large open outlier technique tool user visualisations amp chapter due graphical map present rdatamining science software start statistical studies vacancies business call can case center cfp database detailed google join kdnuggets page poll spatial submission technological twitter analyst california card classification fast find follow function get graph interacting list melbourne notes process provided recent reference thanks tried us video web access added canada canberra china cloud conference dataset distributed dmapps event experience fellow forecasting frequent guidance handling ibm industrial informatics knowledge linkedin machine management nd performance prediction published search sentiment short southern state sydney top topic week wwwrdataminingcom youtube advanced also answers april area build charts check comment contain cran create csiro dec draft dynamic easier edited elsevier engineer format hadoop high igraph initial itemset jan lab language may media microsoft mid new opportunity plot postdocresearch project quick rstudio san senior simple singapore snowfall th titled tj track tweet view visit watson website work 23 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 24 / 34

# remove sparse terms tdm2 <- removesparseterms(tdm, sparse = 0.95) m2 <- as.matrix(tdm2) # cluster terms distmatrix <- dist(scale(m2)) fit <- hclust(distmatrix, method = "ward") 25 / 34

plot(fit) rect.hclust(fit, k = 6) # cut tree into 6 clusters Cluster Dendrogram Height 10 20 30 40 50 60 research position university analysis network social slides use big application book computational tutorial example package r data mining 26 / 34

m3 <- t(m2) # transpose the matrix to cluster documents (tweets) set.seed(122) # set a fixed random seed k <- 6 # number of clusters kmeansresult <- kmeans(m3, k) round(kmeansresult$centers, digits = 3) # cluster centers ## analysis application big book computational data example ## 1 0.136 0.076 0.136 0.015 0.061 1.015 0.030 ## 2 0.026 0.154 0.154 0.256 0.026 1.487 0.231 ## 3 0.857 0.000 0.000 0.000 0.000 0.048 0.095 ## 4 0.078 0.026 0.013 0.052 0.052 0.000 0.065 ## 5 0.083 0.024 0.000 0.048 0.107 0.024 0.274 ## 6 0.091 0.061 0.152 0.000 0.000 0.515 0.000 ## mining network package position r research slides social ## 1 0.409 0.000 0.015 0.076 0.197 0.030 0.091 0.000 ## 2 1.128 0.000 0.205 0.000 0.974 0.026 0.051 0.000 ## 3 0.095 0.952 0.095 0.190 0.286 0.048 0.095 0.810 ## 4 0.104 0.013 0.117 0.039 0.000 0.013 0.104 0.013 ## 5 0.107 0.036 0.190 0.000 1.190 0.000 0.167 0.000 ## 6 0.091 0.000 0.000 0.667 0.000 0.970 0.000 0.121 ## tutorial university use ## 1 0.076 0.030 0.015 ## 2 0.000 0.000 0.231 ## 3 0.190 0.048 0.095 27 / 34

for (i in 1:k) { cat(paste("cluster ", i, ": ", sep = "")) s <- sort(kmeansresult$centers[i, ], decreasing = T) cat(names(s)[1:5], "\n") # print the tweets of every cluster # print(tweets[which(kmeansresult cluster==i)]) } ## cluster 1: data mining r analysis big ## cluster 2: data mining r book example ## cluster 3: network analysis social r position ## cluster 4: package mining slides university analysis ## cluster 5: r example package slides use ## cluster 6: research position data university big 28 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 29 / 34

Topic Modelling dtm <- as.documenttermmatrix(tdm) library(topicmodels) lda <- LDA(dtm, k = 8) # find 8 topics (term <- terms(lda, 6)) # first 6 terms of every topic ## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5... ## [1,] "r" "data" "mining" "r" "data"... ## [2,] "example" "introduction" "r" "package" "mining"... ## [3,] "code" "big" "data" "use" "applicat... ## [4,] "mining" "analysis" "text" "group" "r"... ## [5,] "data" "slides" "package" "example" "use"... ## [6,] "rule" "mining" "time" "cluster" "due"... ## Topic 6 Topic 7 Topic 8 ## [1,] "data" "research" "analysis" ## [2,] "r" "position" "r" ## [3,] "job" "data" "network" ## [4,] "mining" "university" "computational" ## [5,] "lecture" "analytics" "tutorial" ## [6,] "university" "scientist" "slides" term <- apply(term, MARGIN = 2, paste, collapse = ", ") 30 / 34

Topic Modelling # first topic identified for every document (tweet) topic <- topics(lda, 1) topics <- data.frame(date=as.idate(tweets.df$created), topic) qplot(date,..count.., data=topics, geom="density", fill=term[topic], position="stack") 0.4 count 0.2 term[topic] analysis, r, network, computational, tutorial, slides data, introduction, big, analysis, slides, mining data, mining, application, r, use, due data, r, job, mining, lecture, university mining, r, data, text, package, time r, example, code, mining, data, rule r, package, use, group, example, cluster research, position, data, university, analytics, scientist 0.0 2011 07 2012 01 2012 07 2013 01 2013 07 31 / 34

Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 32 / 34

Online Resources Chapter 10: Text Mining, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/rdatamining.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/r-refcard-data-mining.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (12,000+ members) http://group.rdatamining.com RDataMining on Twitter (2,000+ followers) @RDataMining 33 / 34

The End Thanks! Email: yanchang(at)rdatamining.com 34 / 34