Introduction to basic Text Mining in R.



Similar documents
How To Write A Blog Post In R

Time Series Analysis: Basic Forecasting.

Text Mining with R Twitter Data Analysis 1

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Wiley. Automated Data Collection with R. Text Mining. A Practical Guide to Web Scraping and

Distributed Text Mining with tm

Distributed Text Mining with tm

Text Mining Handbook

Web Document Clustering

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Data Mining with R. Text Mining. Hugh Murrell

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

News Sentiment Analysis Using R to Predict Stock Market Trends

Simple Parallel Computing in R Using Hadoop

1. To start Installation: To install the reporting tool, copy the entire contents of the zip file to a directory of your choice. Run the exe.

A Statistical Text Mining Method for Patent Analysis

INTEGRATING GAMIFICATION WITH KNOWLEDGE MANAGEMENT

Installing and Sending with DocuSign for NetSuite v2.2

A Java proxy for MS SQL Server Reporting Services

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Bayesian Spam Filtering

Step One Check for Internet Connection

Technical Report. The KNIME Text Processing Feature:

Copyright EPiServer AB

Searching your Archive in Outlook (Normal)

Predicting the Stock Market with News Articles

Patent Big Data Analysis by R Data Language for Technology Management

Package hive. January 10, 2011


CHECKLIST FOR THE DEGREE PROJECT REPORT

Package TSfame. February 15, 2013

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

SAPScript. A Standard Text is a like our normal documents. In Standard Text, you can create standard documents like letters, articles etc

Gephi focus on data import. Clément Levallois Gephi Support Team and Assist.

that differ from that of a basic online search:

Search and Information Retrieval

SRCSB General Web Development Policy Guidelines Jun. 2010

Good Job! This URL received a A grade. Factor Overview. On-Page Keyword Usage for: modular offices

BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Distributed Computing and Big Data: Hadoop and MapReduce

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Table of Contents. Introduction: 2. Settings: 6. Archive 9. Search Browse Schedule Archiving: 18

Distributed Computing with Hadoop

SpamNet Spam Detection Using PCA and Neural Networks

Classification of Documents using Text Mining Package tm

Word Completion and Prediction in Hebrew

Search Engine Optimization Glossary

Cleopatra Enterprise. Software highlights. Cost Engineering

Real-time Device Monitoring Using AWS

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

1. Starting the management of a subscribers list with emill

TinyUrl (v1.2) 1. Description. 2. Configuration Commands

Large-Scale Test Mining

Package tm. July 3, 2015

Archive Add-in User Guide

Oracle Data Miner (Extension of SQL Developer 4.0)

XML-BASED INTEGRATION: A CASE STUDY

Getting Started in Arbortext and Documentum. Created by Michelle Partridge Doerr April 21, 2009

AWK: The Duct Tape of Computer Science Research. Tim Sherwood UC Santa Barbara

it-yourself SEO mediamediainc Free 30 minute consultation with Media Media Inc Go to our main page and click on Make an Appointment

Site Files. Pattern Discovery. Preprocess ed

How To Backup A Database In Navision

The goal with this tutorial is to show how to implement and use the Selenium testing framework.

Introduction to Python for Text Analysis

SEO FOR VIDEO: FIVE WAYS TO MAKE YOUR VIDEOS EASIER TO FIND

Many of us use certain words to find information when searching on the Internet. These words are commonly referred to as keywords.

Grails 1.1. Web Application. Development. Reclaiming Productivity for Faster. Java Web Development. Jon Dickinson PUBLISHING J MUMBAI BIRMINGHAM

Participant Guide RP301: Ad Hoc Business Intelligence Reporting

Biological Abstracts. Quick Reference Card ISI WEB OF KNOWLEDGE SM. 1 Search

Working with the Ektron Content Management System

Chapter 5 Use the technological Tools for accessing Information 5.1 Overview of Databases are organized collections of information.

The interactive HTTP proxy WebScarab Installation and basic use

The 7 Key Search Engine Optimization Tasks You Must Do

The Ultimate Guide to Magento SEO Part 1: Basic website setup

Elgg 1.8 Social Networking

Last not not Last Last Next! Next! Line Line Forms Forms Here Here Last In, First Out Last In, First Out not Last Next! Call stack: Worst line ever!

HTML5 Data Visualization and Manipulation Tool Colorado School of Mines Field Session Summer 2013

Integration Guide. Integrating Extole with Adobe Target. Overview. Before You Begin. Advantages of A/B Testing

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Package retrosheet. April 13, 2015

Scritto da Administrator Martedì 05 Aprile :23 - Ultimo aggiornamento Giovedì 15 Ottobre :19

JOB READY ASSESSMENT BLUEPRINT WEB DESIGN - PILOT. Test Code: 3750 Version: 01

Using Format Manager For Creating Format Files

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Microsoft Expression Web Quickstart Guide

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

WEB OF SCIENCE CORE COLLECTION

PHP Tutorial From beginner to master

Computer Networking LAB 2 HTTP

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

Setting up Web Material. An introduction

Browser Searching Tips For Windows/Macintosh Twila Baze

Database Applications Microsoft Access

10CS73:Web Programming

Microsoft Excel 2007 Critical Data Analysis Using Functions

Archive Add-in Administrator Guide

Listeners. Formats. Free Form. Formatted

At the most basic level you need two things, namely: You need an IMAP account and you need an authorized source.

Interactive Data Visualization for the Web Scott Murray

Transcription:

Introduction to basic Text Mining in R. As published in Benchmarks RSS Matters, January 2014 http://web3.unt.edu/benchmarks/issues/2014/01/rss-matters Jon Starkweather, PhD 1

Jon Starkweather, PhD jonathan.starkweather@unt.edu Consultant Research and Statistical Support http://www.unt.edu http://www.unt.edu/rss RSS hosts a number of Short Courses. A list of them is available at: http://www.unt.edu/rss/instructional.htm Those interested in learning more about R, or how to use it, can find information here: http://www.unt.edu/rss/class/jon/r_sc 2

Introduction to basic Text Mining in R. This month, we turn our attention to text mining. Text mining refers to the process of parsing a selection or corpus of text in order to identify certain aspects, such as the most frequently occurring word or phrase. In this simple example, we will (of course) be using R 1 to collect a sample of text and conduct some rudimentary analysis of it. Keep in mind, this article simply provides a cursory introduction to some text mining functions. First, we need to retrieve or import some text. We will use the University of North Texas (UNT) policy which governs Research and Statistical Support (RSS) services; UNT policy 3-5 for this example. We can use the readlines function available in the base package to retrieve the policy from the UNT Policy web site. Notice this policy s HTML page is 305 lines long, which includes all the HTML formatting; not just the text of the policy. Next, we need to isolate the actual text of the policy s HTML page. This can take some investigating using the head and tail functions or simply pasting the HTML page into a text editor will allow us to identify the line number(s) which contain the actual text of interest. Once identified, we can use a which function to isolate or extract the lines we are interested in parsing. We notice below the actual text of the policy exists on lines 192 through 197, prefaced by the Total University header on line 189. We use the which function to identify the line (189) with the header statement, then add 3 to it to arrive at line 192 (id.1; which identifies the first line of the policy). Then, we add a further 5 to that (192 + 5 = 197) to identify the last line of the policy (id.2). Then, we create a new object ( text.data ) which contains only those lines which contain the text of the policy. 1 http://cran.r-project.org/ 3

Now we are left with a vector object (text.data), which contains only the 6 lines of text of the policy (i.e. each paragraph of the policy has become a character string line of the vector). Next, we need to remove the HTML tags (e.g. <p>) from each line of text. Generally multiple characters can be given in the pattern argument within one implementation of gsub function; but here we are using two implementations so that we can specifically remove each of the HTML (paragraph) tags while leaving in place all other instances of the letter p. Notice, we are using the replacement argument to eliminate any instance of the pattern argument using nothing between the quotation marks of the replacement argument. Now we can load the tm package (Feinerer & Hornik, 2013) and convert our vector of character strings into a recognizable corpus of text using the VectorSource function and the Corpus function. 4

Next, we make some adjustments to the text; making everything lower case, removing punctuation, removing numbers, and removing common English stop words. The tm map function allows us to apply transformation functions to a corpus. Next we perform stemming, which truncates words (e.g., compute, computes & computing all become comput ). However, we need to load the SnowballC package (Bouchet-Valat, 2013) which allows us to identify specific stem elements using the tm map function of the tm package. 5

Next, we remove all the empty spaces generated by isolating the word stems in the previous step. We use the stripwhitespace argument of the tm map function to accomplish this task. Now we can actually begin to analyze the text. First, we create something called a Term Document Ma- 6

trix (TDM) which is a matrix of frequency counts for each word used in the corpus. Below we only show the first 20 words and their frequencies in each document (i.e. for us, each document is a paragraph in the original policy). Next, we can begin to explore the TDM, using the findfreqterms function, to find which words were used most. Below we specify that we want term / word stems which were used 8 or more times (in all documents / paragraphs). Next, we can use the findassocs function to find words which associate together. Here, we are specifying the TDM to use, the term we want to find associates for, and the lowest acceptable correlation limit with that term. This returns a vector of terms which are associated with comput at r = 0.60 or more (correlation) and reports each association in descending order. 7

If desired, terms which occur very infrequently (i.e. sparse terms) can be removed; leaving only the common terms. Below, the sparse argument refers to the MAXIMUM sparse-ness allowed for a term to be in the returned matrix; in other words, the larger the percentage, the more terms will be retained (the smaller the percentage, the fewer [but more common] terms will be retained). We can review the terms returned from a specific sparse-ness by using the inspect function with the TDMs containing those specific sparse-ness rates (i.e. the terms retained at specific spare-ness levels). Below we see the 22 terms returned when sparse-ness is set to 0.60. 8

Next, we see the 5 terms returned when sparse-ness is set to 0.20 fewer terms which occur more frequently (than above). Conclusions As stated in the introduction to this article, the above functions provide only a cursory introduction to importing some text and parsing it. Those seeking more information may want to consider taking a look at the Natural Language Processing Task View at CRAN (Fridolin, 2013; link provided below). The 9

task view provides information on a number of packages and functions available for processing textual data, including an R-Commander plugin which new R users are likely to find easier to use (at first). For more information on what R can do, please visit the Research and Statistical Support Do-It- Yourself Introduction to R 2 course website. An Adobe.pdf version of this article can be found here 3. Until next time; no twerking while working! References & Resources Bouchet-Valat, M. (2013). Package SnowballC. Documentation available at: http://cran.r-project.org/web/packages/snowballc/index.html Feinerer, I., & Hornik, K. (2013). Package tm. Documentation available at: http://cran.r-project.org/web/packages/tm/index.html Fridolin Wild, F. (2013). Natural Language Processing. A CRAN Task View, located at: http://cran.r-project.org/web/views/naturallanguageprocessing.html Tutorial on text mining (author name not stated on page): http://www.exegetic.biz/blog/2013/09/text-mining-the-complete-works-of-will This article was last updated on January 6, 2014. This document was created using L A TEX 2 http://www.unt.edu/rss/class/jon/r_sc/ 3 http://www.unt.edu/rss/rssmattersindex.htm 10