Distributed Text Mining with tm

Similar documents

Distributed Text Mining with tm

Simple Parallel Computing in R Using Hadoop

Distributed Computing with Hadoop

Package hive. January 10, 2011

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Comparison of Different Implementation of Inverted Indexes in Hadoop

Package hive. July 3, 2015

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Distributed Computing and Big Data: Hadoop and MapReduce

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Introduction to basic Text Mining in R.

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Text Mining with R Twitter Data Analysis 1

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data Processing with Google s MapReduce. Alexandru Costan

Introduction to Hadoop

How To Use Hadoop

How To Write A Blog Post In R

International Journal of Advance Research in Computer Science and Management Studies

MapReduce and Hadoop Distributed File System

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Collaborative Software Development Using R-Forge

A Performance Analysis of Distributed Indexing using Terrier

Introduction to Hadoop

Analysis of MapReduce Algorithms

Big Data and Apache Hadoop s MapReduce

Mobile Cloud Computing for Data-Intensive Applications

Tutorial for Assignment 2.0

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source Google-style large scale data analysis with Hadoop

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Fault Tolerance in Hadoop for Work Migration

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Hadoop and Map-reduce computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

What is Analytic Infrastructure and Why Should You Care?

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

Open source large scale distributed data management with Google s MapReduce and Bigtable

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Reduction of Data at Namenode in HDFS using harballing Technique

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data With Hadoop

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Tutorial for Assignment 2.0

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Search and Information Retrieval

Hadoop Architecture. Part 1

RevoScaleR Speed and Scalability

Analysis of Web Archives. Vinay Goel Senior Data Engineer

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

How To Scale Out Of A Nosql Database

Accelerating and Simplifying Apache

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Map-Reduce for Machine Learning on Multicore

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Log Mining Based on Hadoop s Map and Reduce Technique

An Approach to Implement Map Reduce with NoSQL Databases

Application Development. A Paradigm Shift

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Parallel Processing of cluster by Map Reduce

Storage and Retrieval of Data for Smart City using Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce. Tushar B. Kute,

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Distributed File Systems

Natural Language Processing: A Model to Predict a Sequence of Words

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data-Intensive Computing with Map-Reduce and Hadoop

Distributed File Systems

A Study of Data Management Technology for Handling Big Data

Transcription:

Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Institute for Statistics and Mathematics, WU Vienna 1 Institute of Information Systems, DBAI Group Technische Universität Wien 2 17.04.2010

Motivation For illustration assume that a leader of a party is facing an upcoming election. He or she might be interessted in tracking the general sentiment about the parties in the target population via analyzing political media coverage in order to improve his or her campaign accordingly 1. The New York Stock Exchange (NYSE) processes and stores a massive amount of data during trading days. Many news agencies like Reuters, Bloomberg, and further publishers provide news and stories partly in structured form through RSS feeds or grant paying customers access to their news databases. Text mining might help to extract interesting patterns from news available on the Web. 1 For the 2004 US Presidential Election, Scharl and Weichselbraun (2008) developed a webmining tool to analyze about half a million documents in weekly intervals.

Text Mining

Text Mining Highly interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics Vast amount of textual data available in machine readable format: Content of Websites (Google, Yahoo, etc.) Scientific articles, abstracts, books, etc. (CiteSeerX Project, Epub Repositories, Gutenberg Project, etc.) News feeds (Reuters and other news agencies) Memos, letters, etc. blogs, forums, mailing lists, Twitter etc. Steady increase of text mining methods (both in academia as in industry) within the last decade

Text Mining in R tm Package Tailored for Plain texts, articles and papers Web documents (XML, SGML, etc.) Surveys Available transformations: stemdocument(), stripwhitespace(), tmtolower(), etc. Methods for Clustering Classification Visualization Feinerer (2010) and Feinerer et al. (2008)

Text Mining in R Components of a text mining framework, in particular tm: Sources which abstract input locations (DirSource(), VectorSource(), etc.) Readers (readpdf(), readplain(), readxml(), etc.) A (PlainText-) Document contains contents of the document and meta data A corpus contains one or several documents and corpus-level meta data (abstract class in R) Pre-constructed corpora are available from http://datacube.wu.ac.at. E.g., Reuters21578: install.packages("tm.corpus.reuters21578", repos = "http://datacube.wu.ac.at")

Functions and Methods Display The print() and summary() convert documents to a format so that R can display them. Additional meta information can be shown via summary(). Length The length() function returns the number of documents in the corpus. Subset The [[ operator must be implemented so that individual documents can be extracted from a corpus. Apply The tm_map() function which can be conceptually seen as an lapply() implements functionality to apply a function to a range of documents.

Example: Handling Corpora in R > library("tm") > corpus <- Corpus(DirSource("Data/reuters"), list(reader = readreut21578xml)) > library("tm.corpus.reuters21578") > data(reuters21578) > Reuters21578 A corpus with 21578 text documents > length(reuters21578) [1] 21578 > Reuters21578[[3]] Texas Commerce Bancshares Inc's Texas Commerce Bank-Houston said it filed an application with the Comptroller of the Currency in an effort to create the largest banking network in Harris County. The bank said the network would link 31 banks having 13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. Reuter

Preprocessing Stemming: Erasing word suffixes to retrieve their radicals Reduces complexity almost without loss of information Stemmers provided in packages Rstem 1 and Snowball 2 (preferred) based on Porter (1980) Function stemdocument() Stopword removal: Words with a very low entropy Based on base function gsub() Function removewords() Removal of whitespace (stripwhitespace()) and punctuation (removepunctuation()) work similar 1 Duncan Temple Lang (version 0.3-1 on Omegahat) 2 Kurt Hornik (version 0.0-7 on CRAN)

Example: Preprocessing > stemmed <- tm_map(reuters21578[1:5], stemdocument) > stemmed[[3]] Texa Commerc Bancshar Inc Texas Commerc Bank-Houston said it file an applic with the Comptrol of the Currenc in an effort to creat the largest bank network in Harri County. The bank said the network would link 31 bank having 13.5 billion dlrs in asset and 7.5 billion dlrs in deposits. Reuter > removed <- tm_map(stemmed, removewords, stopwords("english")) > removed[[3]] Texa Commerc Bancshar Inc Texas Commerc Bank-Houston file applic Comptrol Currenc effort creat largest bank network Harri County. The bank network link 31 bank 13.5 billion dlrs asset 7.5 billion dlrs deposits. Reuter

Document-Term Matrices A very common approach in text mining for actual computation on texts is to build a so-called document-term matrix (DTM) holding frequencies of distinct terms tf i,j, i.e., the term frequency (TF) of each term t i for each document d j. Its construction typically involves pre-processing and counting TFs for each document. tf i,j = n i,j where n i,j is the number of occurrences of the considered term i in document j. DTMs are stored using a simple sparse (triplet) representation implemented in package slam by Hornik et al. (2010).

Document-Term Matrices Counting TFs is problematic regarding relevancy in the corpus E.g., (stemmed) terms like signatures occurs in almost all documents in the corpus Typically, the inverse document frequency (IDF) is used to suitably modify the TF weight by a factor that grows with the document frequency df idf i = log N df i Combining both, the TF and IDF weighting we get the term frequency - inverse document frequency (tf -idf ) tf -idf i,j = tf i,j idf i

Example: Document-Term Matrices > Reuters21578_DTM <- DocumentTermMatrix(Reuters21578, list(stemming = TRUE, + removepunctuation = TRUE)) > data(reuters21578_dtm) > Reuters21578_DTM A document-term matrix (21578 documents, 33090 terms) Non-/sparse entries: 877918/713138102 Sparsity : 100% Maximal term length: 30 Weighting : term frequency (tf) > inspect(reuters21578_dtm[51:54, 51:54]) A document-term matrix (4 documents, 4 terms) Non-/sparse entries: 0/16 Sparsity : 100% Maximal term length: 10 Weighting : term frequency (tf) Terms Docs abdul abdulaziz abdulhadi abdulkadir 51 0 0 0 0 52 0 0 0 0 53 0 0 0 0 54 0 0 0 0

Challenges Data volumes (corpora) become bigger and bigger Many tasks, i.e. we produce output data via processing lots of input data Processing large data sets in a single machine is limited by the available main memory (i.e., RAM) Text mining methods are becoming more complex and hence computer intensive Want to make use of many CPUs Typically this is not easy (parallelization, synchronization, I/O, debugging, etc.) Need for an integrated framework preferably usable on large scale distributed systems Main motivation: large scale data processing

Distributed Text Mining in R Data sets: Reuters-21578: one of the most widely used test collection for text categorization research (news from 1987) NSF Research Award Abstracts (NSF): consists of 129, 000 plain text abstracts describing NSF awards for basic research submitted between 1990 and 2003. It is divided into three parts. Reuters Corpus Volume 1 (RCV1): > 800.000 text documents New York Times Annotated Corpus (NYT): > 1.8 million articles articles published by the New York Times between 1987-01-01 and 2007-06-19 # documents corpus size [MB] 1 size DTM [MB] Reuters-21578 21, 578 87 16.4 NSF (Part 1) 51, 760 236 101.4 RCV1 806, 791 3, 800 1130.8 NYT 1, 855, 658 16, 160 NA 1 calculated with the Unix tool du

Opportunities Distributed computing environments are scalable in terms of CPUs and memory (disk space and RAM) employed. Multi-processor environments and large scale compute clusters/clouds available for a reasonable price Integrated frameworks for parallel/distributed computing available (e.g., Hadoop) Thus, parallel/distributed computing is now easier than ever R already offers extensions to use this software: e.g., via hive, RHIPE, nws, iterators, multicore, Rmpi, snow, etc. Employing such systems with the right tools we can significantly reduce runtime for processing large data sets.

Distributed Text Mining in R

Distributed Text Mining in R Difficulties: Large data sets Corpus typically loaded into memory Operations on all elements of the corpus (so-called transformations) Strategies: Text mining using tm and MapReduce/hive 1 Text mining using tm and MPI/snow 2 1 Stefan Theußl (version 0.1-2) 2 Luke Tierney (version 0.3-3)

The MapReduce Programming Model Programming model inspired by functional language primitives Automatic parallelization and distribution Fault tolerance I/O scheduling Examples: document clustering, web access log analysis, search index construction,... Dean and Ghemawat (2004) Hadoop (http://hadoop.apache.org/core/) developed by the Apache project is an open source implementation of MapReduce.

The MapReduce Programming Model Distributed Data Local Data Local Data Local Data Map Map Map Intermediate Data Partial Result Partial Result Partial Result Reduce Reduce Aggregated Result Figure: Conceptual Flow

The MapReduce Programming Model A MapReduce implementation like Hadoop typically provides a distributed file system (DFS, Ghemawat et al., 2003): Master/worker architecture (Namenode/Datanodes) Data locality Map tasks are applied to partitioned data Map tasks scheduled so that input blocks are on same machine Datanodes read input at local disk speed Data replication leads to fault tolerance Application does not care whether nodes are OK or not

Hadoop Streaming Utility allowing to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input inputdir -output outputdir -mapper./mapper -reducer./reducer Local Data Intermediate Data Processed Data stdin() stdout() stdin() stdout() R: Map R: Reduce

Hadoop InteractiVE (hive) hive provides: Easy-to-use interface to Hadoop Currently, only Hadoop core (http://hadoop.apache.org/core/) supported High-level functions for handling Hadoop framework (hive_start(), hive_create(), hive_is_available(), etc.) DFS accessor functions in R (DFS_put(), DFS_list(), DFS_cat(), etc.) Streaming via Hadoop (hive_stream()) Available on R-Forge in project RHadoop

Example: Word Count Data preparation: 1 > library (" hive ") 2 Loading required package : rjava 3 Loading required package : XML 4 > hive _ start () 5 > hive _is_ available () 6 [1] TRUE 7 > DFS _ put ("~/ Data / Reuters / minimal ", "/ tmp / Reuters ") 8 > DFS _ list ("/ tmp / Reuters ") 9 [1] "reut -00001. xml " "reut -00002. xml " "reut -00003. xml " 10 [4] "reut -00004. xml " "reut -00005. xml " "reut -00006. xml " 11 [7] "reut -00007. xml " "reut -00008. xml " "reut -00009. xml " 12 > head ( DFS _ read _ lines ("/ tmp / Reuters /reut -00002. xml ")) 13 [1] " <? xml version =\" 1.0\ "?>" 14 [2] "<REUTERS TOPICS =\"NO\" LEWISSPLIT =\" TRAIN \" [...] 15 [3] " <DATE >26 - FEB -1987 15:03:27.51 < / DATE >" 16 [4] " <TOPICS />" 17 [5] " <PLACES >" 18 [6] " <D>usa </D>"

Distributed Text Mining in R Solution: 1. Distributed storage Data set copied to DFS ( DistributedCorpus ) Only meta information about the corpus remains in memory 2. Parallel computation Computational operations (Map) on all elements in parallel MapReduce paradigm Work horses tm_map() and TermDocumentMatrix() Processed documents (revisions) can be retrieved on demand. Implemented in a plugin package to tm: tm.plugin.dc.

Distributed Text Mining in R > library("tm.plugin.dc") > dc <- DistributedCorpus(DirSource("Data/reuters"), + list(reader = readreut21578xml) ) > dc <- as.distributedcorpus(reuters21578) > summary(dc) A corpus with 21578 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID --- Distributed Corpus --- Available revisions: 20100417144823 Active revision: 20100417144823 DistributedCorpus: Storage - Description: Local Disk Storage - Base directory on storage: /tmp/rtmpuxx3w7/file5bd062c2 - Current chunk size [bytes]: 10485760 > dc <- tm_map(dc, stemdocument)

Distributed Text Mining in R > print(object.size(reuters21578), units = "Mb") 109.5 Mb > dc A corpus with 21578 text documents > dc_storage(dc) DistributedCorpus: Storage - Description: Local Disk Storage - Base directory on storage: /tmp/rtmpuxx3w7/file5bd062c2 - Current chunk size [bytes]: 10485760 > dc[[3]] Texas Commerce Bancshares Inc's Texas Commerce Bank-Houston said it filed an application with the Comptroller of the Currency in an effort to create the largest banking network in Harris County. The bank said the network would link 31 banks having 13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. Reuter > print(object.size(dc), units = "Mb") 0.6 Mb

Constructing DTMs via MapReduce Parallelization of transformations via tm_map() Parallelization of DTM construction by appropriate methods Via Hadoop streaming utility (R interface hive_stream()) Key / Value pairs: docid / tmdoc (document ID, serialized tm document) Differs from MPI/snow approach where an lapply() gets replaced by a parlapply()

Constructing DTMs via MapReduce 1. Input: <docid, tmdoc> 2. Preprocess (Map): <docid, tmdoc> <term, docid, tf> 3. Partial combine (Reduce): <term, docid, tf> <term, list(docid, tf)> 4. Collection: <term, list(docid, tf)> DTM

Distributed Text Mining in R Infrastructure: bignode.q 4 nodes 2 Dual Core Intel XEON 5140 @ 2.33 GHz 16 GB RAM node.q 68 nodes 1 Intel Core 2 Duo 6600 @ 2.4 GHz 4 GB RAM This is a total of 152 64-bit computation nodes and a total of 336 gigabytes of RAM. MapReduce framework: Hadoop 0.20.1 (implements MapReduce + DFS) R (2.10.1) with tm (0.5-2) and hive (0.1-2) Code implementing DistributedCorpus in (tm.plugin.dc)

Figure: Runtime in seconds for stemming, stopword removal, and whitespace removal on the full Reuters-21578 data set (above) and on Benchmark Stemming Stopwords Whitespace Speedup 0 2 4 6 8 10 Speedup 0 5 10 15 20 Speedup 0 5 10 15 20 25 30 5 10 15 5 10 15 5 10 15 #CPU #CPU #CPU Stemming Stopwords Whitespace Speedup 0 5 10 15 Speedup 0 2 4 6 8 Speedup 0.0 0.5 1.0 1.5 5 10 15 5 10 15 5 10 15 #CPU #CPU #CPU

Outlook and Conclusion

Outlook - Computing on Texts > library("slam") > cs <- col_sums(reuters21578_dtm) > (top20 <- head(sort(cs, decreasing = TRUE), n = 20)) mln dlrs reuter pct compani bank billion share cts market 25513 20528 19973 17013 11396 11170 10240 9627 8847 7869 price trade inc net stock corp loss rate sale oper 6944 6865 6695 6070 6050 6005 5719 5406 5138 4699 > DTM_tfidf <- weighttfidf(reuters21578_dtm) > DTM_tfidf A document-term matrix (21578 documents, 33090 terms) Non-/sparse entries: 877918/713138102 Sparsity : 100% Maximal term length: 30 Weighting : term frequency - inverse document frequency (normalized) (t > cs_tfidf <- col_sums(dtm_tfidf) > cs_tfidf[names(top20)] mln dlrs reuter pct compani bank billion share 780.2629 544.7084 103.8984 455.8033 347.4688 355.8040 388.7671 421.8325 cts market price trade inc net stock corp 1119.2683 217.8757 235.4840 216.8312 341.8051 572.6022 290.0943 292.2116 loss rate sale oper 552.9568 197.7876 320.2611 253.8356

Outlook - Sentiment Analysis Compute sentiment scores based on e.g., daily news Use (normalized) TF-IDF scores Currently General Inquirer tag categories are employed (provided in package tm.plugin.tags) Construct time series of tagged documents > library("tm.plugin.tags") > head(tm_get_tags("positiv", collection = "general_inquirer")) [1] "abide" "ability" "able" "abound" "absolve" "absorbent Compare with time series of interest (e.g., of a financial instrument)

Figure: Time series of scores (normalized TF) for barrel-crude-oil, sentiment scores, and the stock price of CBT. Outlook - Sentiment Analysis score 0 2 4 6 crude barrel oil 1987 02 26 1987 03 09 1987 03 18 1987 03 26 1987 04 03 1987 04 13 1987 04 29 1987 10 19 date score 0.00 0.15 Positiv Negativ Rise Fall Stay 1987 02 26 1987 03 09 1987 03 18 1987 03 26 1987 04 03 1987 04 13 1987 04 29 1987 10 19 date GSPC$GSPC.Close 240 300 Feb 26 1987 Apr 01 1987 May 01 1987 Jun 01 1987 Jul 01 1987 Aug 03 1987 Sep 01 1987 Oct 01 Oct 19 1987 1987

Conclusion - tm and Hadoop Use of Hadoop in particular the DFS enhances handling of large corpora Significant speedup in text mining applications Thus, MapReduce has proven to be a useful abstraction Greatly simplifies distributed computing Developer focus on problem Implementations like Hadoop deal with messy details different approaches to facilitate Hadoop s infrastructure language- and use case dependent

Conclusion - Text Mining in R The complete text mining infrastructure consists of many components: tm, text mining package (0.5-3.2, Feinerer, 2010) slam, sparse lightweigt arrays and matrices (0.1-11, Hornik et.al., 2010) tm.plugin.dc, distributed corpus plugin (0.1-2, Theussl and Feinerer, 2010) tm.plugin.tags, tag category database (0.0-1, Theussl, 2010) hive, Hadoop/MapReduce interface (0.1-5, Theussl and Feinerer, 2010) Two of them are are released on CRAN (tm, slam), two are currently in an advanced development stage on R-Forge in project RHadoop (hive, tm.plugin.dc), and one will be released shortly (tm.plugin.tags). Eventually, combine everything in a news mining package

Thank You for Your Attention! Stefan Theußl Institute for Statistics and Mathematics email: Stefan.Theussl@wu.ac.at URL: http://statmath.wu.ac.at/~theussl WU Vienna Augasse 2 6, A-1090 Wien

References J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation, pages 137 150, 2004. URL http://labs.google.com/papers/mapreduce.html. I. Feinerer. tm: Text Mining Package, 2010. URL http://tm.r-forge.r-project.org/. R package version 0.5-3. I. Feinerer, K. Hornik, and D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1 54, March 2008. ISSN 1548-7660. URL http://www.jstatsoft.org/v25/i05. S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29 43, New York, NY, USA, October 2003. ACM Press. doi: http://doi.acm.org/10.1145/1165389.945450. K. Hornik, D. Meyer, and C. Buchta. slam: Sparse Lightweight Arrays and Matrices, 2010. URL http://cran.r-project.org/package=slam. R package version 0.1-9. M. Porter. An algorithm for suffix stripping. Program, 3:130 137, 1980. A. Scharl and A. Weichselbraun. An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology and Politics, 5(1):121 132, 2008.