Data Mining with R. Text Mining. Hugh Murrell

Similar documents

How To Write A Blog Post In R

Text Mining with R Twitter Data Analysis 1

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Package wordcloud. R topics documented: February 20, 2015

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Cluster Analysis using R

Unsupervised learning: Clustering

Introduction to basic Text Mining in R.

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Classification of Documents using Text Mining Package tm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Web Document Clustering

Neural Networks Lesson 5 - Cluster Analysis

Clustering UE 141 Spring 2013

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Twitter Data Analysis: Hadoop2 Map Reduce Framework

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Patent Big Data Analysis by R Data Language for Technology Management

Movie Classification Using k-means and Hierarchical Clustering

CLUSTERING FOR FORENSIC ANALYSIS

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Information Retrieval and Web Search Engines

Hands-On Data Science with R Text Mining. Graham.Williams@togaware.com. 5th November 2014 DRAFT

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter ML:XI (continued)

SoSe 2014: M-TANI: Big Data Analytics

Distances, Clustering, and Classification. Heatmaps

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Social Media Mining. Data Mining Essentials

Time series clustering and the analysis of film style

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Technical Report. The KNIME Text Processing Feature:

Wiley. Automated Data Collection with R. Text Mining. A Practical Guide to Web Scraping and

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Distributed Computing and Big Data: Hadoop and MapReduce

NaviCell Data Visualization Python API

Clustering Technique in Data Mining for Text Documents

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Hierarchical Cluster Analysis Some Basics and Algorithms

US Employment * All data from the U.S. Bureau of Labor Statistics

The Social Science Data Revolution

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Chapter 2 Text Processing with the Command Line Interface

Text Clustering Using LucidWorks and Apache Mahout

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

Chapter 7. Cluster Analysis

A SURVEY OF TEXT CLUSTERING ALGORITHMS

NVivo 10 for Windows and NVivo for Mac

Enterprise Data Visualization and BI Dashboard

THE PRESIDENCY OF GEORGE W. BUSH January 11-15, 2009

Collecting Polish German Parallel Corpora in the Internet

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

A Statistical Text Mining Method for Patent Analysis

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

4 Other useful features on the course web page. 5 Accessing SAS

Mining Text Data: An Introduction

Big Data and Scripting map/reduce in Hadoop

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Address America: Your Six-Word Stump Speech Lesson Plan

How To Identify Noisy Variables In A Cluster

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Client Overview. Engagement Situation. Key Requirements

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Search and Information Retrieval

Creating an Intranet Website for Library & Information Services in an Organization

A comparison of various clustering methods and algorithms in data mining

Using R for Social Media Analytics

A Demonstration of Hierarchical Clustering

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Statistical Databases and Registers with some datamining

Data Mining Individual Assignment report

Data Mining and Visualization

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Resource-bounded Fraud Detection

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Neovision2 Performance Evaluation Protocol

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Classify then Summarize or Summarize then Classify

Search Engines. Stephen Shaw 18th of February, Netsoc

COC131 Data Mining - Clustering

Using Data Mining for Mobile Communication Clustering and Characterization

MultiExperiment Viewer Quickstart Guide

Data Intensive Computing Handout 6 Hadoop

Transcription:

Data Mining with R Text Mining Hugh Murrell

reference books These slides are based on a book by Yanchang Zhao: R and Data Mining: Examples and Case Studies. http://www.rdatamining.com for further background try Andrew Moore s slides from: http://www.autonlab.org/tutorials and as always, wikipedia is a useful source of information.

text mining This lecture presents examples of text mining with R. We extract text from the BBC s webpages on Alastair Cook s letters from America. The extracted text is then transformed to build a term-document matrix. Frequent words and associations are found from the matrix. A word cloud is used to present frequently occuring words in documents. Words and transcripts are clustered to find groups of words and groups of transcripts. In this lecture, transcript and document will be used interchangeably, so are word and term.

text mining packages many new packages are introduced in this lecture: tm: [Feinerer, 2012] provides functions for text mining, wordcloud [Fellows, 2012] visualizes results. fpc [Christian Hennig, 2005] flexible procedures for clustering. igraph [Gabor Csardi, 2012] a library and R package for network analysis.

retrieving text from the BBC website This work is part of the Rozanne Els PhD project She has written a script to download transcripts direct from the website http://www.bbc.co.uk/programmes/b00f6hbp/broadcasts/1947/01 The results are stored in a local directory, ACdatedfiles, on this apple mac.

loading the corpus from disc Now we are in a position to load the transcripts directly from our hard drive and perform corpus cleaning using the tm package. > library(tm) > corpus <- Corpus( + DirSource("./ACdatedfiles", + encoding = "UTF-8"), + readercontrol = list(language = "en") + )

cleaning the corpus now we use regular expressions to remove at-tags and urls from the remaining documents > # get rid of html tags, write and re-read the cleane > pattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\" '. > rmhtml <- function(x) + gsub(pattern, "", x) > corpus <- tm_map(corpus, + content_transformer(rmhtml)) > writecorpus(corpus,path="./ac", + filenames = paste("d", seq_along(corpus), + ".txt", sep = "")) > corpus <- Corpus( + DirSource("./ac", encoding = "UTF-8"), + readercontrol = list(language = "en") + )

further cleaning now we use text cleaning transformations: > # make each letter lowercase, remove white space, > # remove punctuation and remove generic and custom s > corpus <- tm_map(corpus, + content_transformer(tolower)) > corpus <- tm_map(corpus, + content_transformer(stripwhitespace)) > corpus <- tm_map(corpus, + content_transformer(removepunctuation)) > my_stopwords <- c(stopwords('english'), + c('dont','didnt','arent','cant','one','two','three' > corpus <- tm_map(corpus, + content_transformer(removewords), + my_stopwords)

stemming words In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem would be taken as the same when counting word frequency. For instance, words update, updated and updating should all be stemmed to updat. Sometimes stemming is counter productive so I chose not to do it here. > # to carry out stemming > # corpus <- tm_map(corpus, stemdocument, > # language = "english")

building a term-document matrix A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. > (tdm <- TermDocumentMatrix(corpus)) <<TermDocumentMatrix (terms: 54690, documents: 911)>> Non-/sparse entries: 545261/49277329 Sparsity : 99% Maximal term length: 33 Weighting : term frequency (tf)

frequent terms Now we can have a look at the popular words in the term-document matrix, > (tt <- findfreqterms(tdm, lowfreq=1500)) [1] "ago" "american" "called" [4] "came" "can" "country" [7] "day" "every" "first" [10] "going" "house" "just" [13] "know" "last" "like" [16] "long" "man" "many" [19] "much" "never" "new" [22] "now" "old" "people" [25] "president" "said" "say" [28] "states" "think" "time" [31] "united" "war" "way" [34] "well" "will" "year" [37] "years"

frequent terms Note that the frequent terms are ordered alphabetically, instead of by frequency or popularity. To show the top frequent words visually, we make a barplot of them. > termfrequency <- + rowsums(as.matrix(tdm[tt,])) > library(ggplot2) > qplot(names(termfrequency), termfrequency, + geom="bar", stat="identity") + + coord_flip()

frequent term bar chart names(termfrequency) years year will well way war united time think states say said president people old now new never much many man long like last know just house going first every day country can came called american ago 0 1000 2000 3000 4000 termfrequency

wordclouds We can show the importance of words pictorally with a wordcloud [Fellows, 2012]. In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. After that we use wordcloud to make a pictorial. > tdmat = as.matrix(tdm) > # calculate the frequency of words > v = sort(rowsums(tdmat), decreasing=true) > d = data.frame(word=names(v), freq=v) > # generate the wordcloud > library(wordcloud) > wordcloud(d$word, d$freq, min.freq=900, + random.color=true,colors=rainbow(7))

wordcloud pictorial day man right year since never many united congress put war thought see new nations called americans think always white government come went now menpeople young came way say years long going days get time old public back still much country might well ago television said general little every like last big something ever course york world even clinton must know national next american house first four may made will week great america bush states end life city whole just can state good another

clustering the words We now try to find clusters of words with hierarchical clustering. Sparse terms are removed, so that the plot of clustering will not be crowded with words. Then the distances between terms are calculated with dist() after scaling. After that, the terms are clustered with hclust() and the dendrogram is cut into 10 clusters. The agglomeration method is set to ward, which denotes the increase in variance when two clusters are merged. Some other options are single linkage, complete linkage, average linkage, median and centroid.

word clustering code > # remove sparse terms > tdmat <- as.matrix( + removesparseterms(tdm, sparse=0.3) + ) > # compute distances > distmatrix <- dist(scale(tdmat)) > fit <- hclust(distmatrix, method="ward.d2") > plot(fit)

word clustering dendogram Cluster Dendrogram Height 20 40 60 80 100 last year back made put course thought going come since even much know think ago long came never country called great man old day like well many just way can every states united american people will first now say said time years president new distmatrix hclust (*, "ward.d2")

clustering documents with k-medoids We now try k-medoids clustering with the Partitioning Around Medoids algorithm. In the following example, we use function pamk() from package fpc [Hennig, 2010], which calls the function pam() with the number of clusters estimated by optimum average silhouette. Note that because we are now clustering documents rather than words we must first transpose the term-document matrix to a document-term matrix.

k-medoid clustering > # first select terms corresponding to presidents > pn <- c("nixon","carter","reagan","clinton", + "roosevelt","kennedy", + "truman","bush","ford") > # and transpose the reduced term-document matrix > dtm <- t(as.matrix(tdm[pn,])) > # find clusters > library(fpc) > library(cluster) > pamresult <- pamk(dtm, krange=2:6, + metric = "manhattan") > # number of clusters identified > (k <- pamresult$nc) [1] 3 >

generate cluster wordclouds > layout(matrix(c(1,2,3),1,3)) > for(k in 1:3){ + cl <- which( pamresult$pamobject$clustering == k ) + tdmk <- t(dtm[cl,]) + v = sort(rowsums(tdmk), decreasing=true) + d = data.frame(word=names(v), freq=v) + # generate the wordcloud + wordcloud(d$word, d$freq, min.freq=5, + random.color=true,colors="black") + } > layout(matrix(1))

ford cluster wordclouds nixon clinton roosevelt bush ford truman reagan carter kennedy reagan roosevelt ford carter nixon truman clinton bush roosevelt kennedy reagan nixon bush clinton truman carter

exercises No exercises this week!