Distributed Text Mining with tm
|
|
|
- Aubrey Copeland
- 10 years ago
- Views:
Transcription
1 Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Institute for Statistics and Mathematics, WU Vienna 1 Institute of Information Systems, DBAI Group Technische Universität Wien
2 Motivation For illustration assume that a leader of a party is facing an upcoming election. He or she might be interessted in tracking the general sentiment about the parties in the target population via analyzing political media coverage in order to improve his or her campaign accordingly 1. The New York Stock Exchange (NYSE) processes and stores a massive amount of data during trading days. Many news agencies like Reuters, Bloomberg, and further publishers provide news and stories partly in structured form through RSS feeds or grant paying customers access to their news databases. Text mining might help to extract interesting patterns from news available on the Web. 1 For the 2004 US Presidential Election, Scharl and Weichselbraun (2008) developed a webmining tool to analyze about half a million documents in weekly intervals.
3 Text Mining
4 Text Mining Highly interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics Vast amount of textual data available in machine readable format: Content of Websites (Google, Yahoo, etc.) Scientific articles, abstracts, books, etc. (CiteSeerX Project, Epub Repositories, Gutenberg Project, etc.) News feeds (Reuters and other news agencies) Memos, letters, etc. blogs, forums, mailing lists, Twitter etc. Steady increase of text mining methods (both in academia as in industry) within the last decade
5 Text Mining in R tm Package Tailored for Plain texts, articles and papers Web documents (XML, SGML, etc.) Surveys Available transformations: stemdocument(), stripwhitespace(), tmtolower(), etc. Methods for Clustering Classification Visualization Feinerer (2010) and Feinerer et al. (2008)
6 Text Mining in R Components of a text mining framework, in particular tm: Sources which abstract input locations (DirSource(), VectorSource(), etc.) Readers (readpdf(), readplain(), readxml(), etc.) A (PlainText-) Document contains contents of the document and meta data A corpus contains one or several documents and corpus-level meta data (abstract class in R) Pre-constructed corpora are available from E.g., Reuters21578: install.packages("tm.corpus.reuters21578", repos = "
7 Functions and Methods Display The print() and summary() convert documents to a format so that R can display them. Additional meta information can be shown via summary(). Length The length() function returns the number of documents in the corpus. Subset The [[ operator must be implemented so that individual documents can be extracted from a corpus. Apply The tm_map() function which can be conceptually seen as an lapply() implements functionality to apply a function to a range of documents.
8 Example: Handling Corpora in R > library("tm") > corpus <- Corpus(DirSource("Data/reuters"), list(reader = readreut21578xml)) > library("tm.corpus.reuters21578") > data(reuters21578) > Reuters21578 A corpus with text documents > length(reuters21578) [1] > Reuters21578[[3]] Texas Commerce Bancshares Inc's Texas Commerce Bank-Houston said it filed an application with the Comptroller of the Currency in an effort to create the largest banking network in Harris County. The bank said the network would link 31 banks having 13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. Reuter
9 Preprocessing Stemming: Erasing word suffixes to retrieve their radicals Reduces complexity almost without loss of information Stemmers provided in packages Rstem 1 and Snowball 2 (preferred) based on Porter (1980) Function stemdocument() Stopword removal: Words with a very low entropy Based on base function gsub() Function removewords() Removal of whitespace (stripwhitespace()) and punctuation (removepunctuation()) work similar 1 Duncan Temple Lang (version on Omegahat) 2 Kurt Hornik (version on CRAN)
10 Example: Preprocessing > stemmed <- tm_map(reuters21578[1:5], stemdocument) > stemmed[[3]] Texa Commerc Bancshar Inc Texas Commerc Bank-Houston said it file an applic with the Comptrol of the Currenc in an effort to creat the largest bank network in Harri County. The bank said the network would link 31 bank having 13.5 billion dlrs in asset and 7.5 billion dlrs in deposits. Reuter > removed <- tm_map(stemmed, removewords, stopwords("english")) > removed[[3]] Texa Commerc Bancshar Inc Texas Commerc Bank-Houston file applic Comptrol Currenc effort creat largest bank network Harri County. The bank network link 31 bank 13.5 billion dlrs asset 7.5 billion dlrs deposits. Reuter
11 Document-Term Matrices A very common approach in text mining for actual computation on texts is to build a so-called document-term matrix (DTM) holding frequencies of distinct terms tf i,j, i.e., the term frequency (TF) of each term t i for each document d j. Its construction typically involves pre-processing and counting TFs for each document. tf i,j = n i,j where n i,j is the number of occurrences of the considered term i in document j. DTMs are stored using a simple sparse (triplet) representation implemented in package slam by Hornik et al. (2010).
12 Document-Term Matrices Counting TFs is problematic regarding relevancy in the corpus E.g., (stemmed) terms like signatures occurs in almost all documents in the corpus Typically, the inverse document frequency (IDF) is used to suitably modify the TF weight by a factor that grows with the document frequency df idf i = log N df i Combining both, the TF and IDF weighting we get the term frequency - inverse document frequency (tf -idf ) tf -idf i,j = tf i,j idf i
13 Example: Document-Term Matrices > Reuters21578_DTM <- DocumentTermMatrix(Reuters21578, list(stemming = TRUE, + removepunctuation = TRUE)) > data(reuters21578_dtm) > Reuters21578_DTM A document-term matrix (21578 documents, terms) Non-/sparse entries: / Sparsity : 100% Maximal term length: 30 Weighting : term frequency (tf) > inspect(reuters21578_dtm[51:54, 51:54]) A document-term matrix (4 documents, 4 terms) Non-/sparse entries: 0/16 Sparsity : 100% Maximal term length: 10 Weighting : term frequency (tf) Terms Docs abdul abdulaziz abdulhadi abdulkadir
14 Challenges Data volumes (corpora) become bigger and bigger Many tasks, i.e. we produce output data via processing lots of input data Processing large data sets in a single machine is limited by the available main memory (i.e., RAM) Text mining methods are becoming more complex and hence computer intensive Want to make use of many CPUs Typically this is not easy (parallelization, synchronization, I/O, debugging, etc.) Need for an integrated framework preferably usable on large scale distributed systems Main motivation: large scale data processing
15 Distributed Text Mining in R Data sets: Reuters-21578: one of the most widely used test collection for text categorization research (news from 1987) NSF Research Award Abstracts (NSF): consists of 129, 000 plain text abstracts describing NSF awards for basic research submitted between 1990 and It is divided into three parts. Reuters Corpus Volume 1 (RCV1): > text documents New York Times Annotated Corpus (NYT): > 1.8 million articles articles published by the New York Times between and # documents corpus size [MB] 1 size DTM [MB] Reuters , NSF (Part 1) 51, RCV1 806, 791 3, NYT 1, 855, , 160 NA 1 calculated with the Unix tool du
16 Opportunities Distributed computing environments are scalable in terms of CPUs and memory (disk space and RAM) employed. Multi-processor environments and large scale compute clusters/clouds available for a reasonable price Integrated frameworks for parallel/distributed computing available (e.g., Hadoop) Thus, parallel/distributed computing is now easier than ever R already offers extensions to use this software: e.g., via hive, RHIPE, nws, iterators, multicore, Rmpi, snow, etc. Employing such systems with the right tools we can significantly reduce runtime for processing large data sets.
17 Distributed Text Mining in R
18 Distributed Text Mining in R Difficulties: Large data sets Corpus typically loaded into memory Operations on all elements of the corpus (so-called transformations) Strategies: Text mining using tm and MapReduce/hive 1 Text mining using tm and MPI/snow 2 1 Stefan Theußl (version 0.1-2) 2 Luke Tierney (version 0.3-3)
19 The MapReduce Programming Model Programming model inspired by functional language primitives Automatic parallelization and distribution Fault tolerance I/O scheduling Examples: document clustering, web access log analysis, search index construction,... Dean and Ghemawat (2004) Hadoop ( developed by the Apache project is an open source implementation of MapReduce.
20 The MapReduce Programming Model Distributed Data Local Data Local Data Local Data Map Map Map Intermediate Data Partial Result Partial Result Partial Result Reduce Reduce Aggregated Result Figure: Conceptual Flow
21 The MapReduce Programming Model A MapReduce implementation like Hadoop typically provides a distributed file system (DFS, Ghemawat et al., 2003): Master/worker architecture (Namenode/Datanodes) Data locality Map tasks are applied to partitioned data Map tasks scheduled so that input blocks are on same machine Datanodes read input at local disk speed Data replication leads to fault tolerance Application does not care whether nodes are OK or not
22 Hadoop Streaming Utility allowing to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input inputdir -output outputdir -mapper./mapper -reducer./reducer Local Data Intermediate Data Processed Data stdin() stdout() stdin() stdout() R: Map R: Reduce
23 Hadoop InteractiVE (hive) hive provides: Easy-to-use interface to Hadoop Currently, only Hadoop core ( supported High-level functions for handling Hadoop framework (hive_start(), hive_create(), hive_is_available(), etc.) DFS accessor functions in R (DFS_put(), DFS_list(), DFS_cat(), etc.) Streaming via Hadoop (hive_stream()) Available on R-Forge in project RHadoop
24 Example: Word Count Data preparation: 1 > library (" hive ") 2 Loading required package : rjava 3 Loading required package : XML 4 > hive _ start () 5 > hive _is_ available () 6 [1] TRUE 7 > DFS _ put ("~/ Data / Reuters / minimal ", "/ tmp / Reuters ") 8 > DFS _ list ("/ tmp / Reuters ") 9 [1] "reut xml " "reut xml " "reut xml " 10 [4] "reut xml " "reut xml " "reut xml " 11 [7] "reut xml " "reut xml " "reut xml " 12 > head ( DFS _ read _ lines ("/ tmp / Reuters /reut xml ")) 13 [1] " <? xml version =\" 1.0\ "?>" 14 [2] "<REUTERS TOPICS =\"NO\" LEWISSPLIT =\" TRAIN \" [...] 15 [3] " <DATE >26 - FEB :03:27.51 < / DATE >" 16 [4] " <TOPICS />" 17 [5] " <PLACES >" 18 [6] " <D>usa </D>"
25 Distributed Text Mining in R Solution: 1. Distributed storage Data set copied to DFS ( DistributedCorpus ) Only meta information about the corpus remains in memory 2. Parallel computation Computational operations (Map) on all elements in parallel MapReduce paradigm Work horses tm_map() and TermDocumentMatrix() Processed documents (revisions) can be retrieved on demand. Implemented in a plugin package to tm: tm.plugin.dc.
26 Distributed Text Mining in R > library("tm.plugin.dc") > dc <- DistributedCorpus(DirSource("Data/reuters"), + list(reader = readreut21578xml) ) > dc <- as.distributedcorpus(reuters21578) > summary(dc) A corpus with text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID --- Distributed Corpus --- Available revisions: Active revision: DistributedCorpus: Storage - Description: Local Disk Storage - Base directory on storage: /tmp/rtmpuxx3w7/file5bd062c2 - Current chunk size [bytes]: > dc <- tm_map(dc, stemdocument)
27 Distributed Text Mining in R > print(object.size(reuters21578), units = "Mb") Mb > dc A corpus with text documents > dc_storage(dc) DistributedCorpus: Storage - Description: Local Disk Storage - Base directory on storage: /tmp/rtmpuxx3w7/file5bd062c2 - Current chunk size [bytes]: > dc[[3]] Texas Commerce Bancshares Inc's Texas Commerce Bank-Houston said it filed an application with the Comptroller of the Currency in an effort to create the largest banking network in Harris County. The bank said the network would link 31 banks having 13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. Reuter > print(object.size(dc), units = "Mb") 0.6 Mb
28 Constructing DTMs via MapReduce Parallelization of transformations via tm_map() Parallelization of DTM construction by appropriate methods Via Hadoop streaming utility (R interface hive_stream()) Key / Value pairs: docid / tmdoc (document ID, serialized tm document) Differs from MPI/snow approach where an lapply() gets replaced by a parlapply()
29 Constructing DTMs via MapReduce 1. Input: <docid, tmdoc> 2. Preprocess (Map): <docid, tmdoc> <term, docid, tf> 3. Partial combine (Reduce): <term, docid, tf> <term, list(docid, tf)> 4. Collection: <term, list(docid, tf)> DTM
30 Distributed Text Mining in R Infrastructure: bignode.q 4 nodes 2 Dual Core Intel XEON 2.33 GHz 16 GB RAM node.q 68 nodes 1 Intel Core 2 Duo 2.4 GHz 4 GB RAM This is a total of bit computation nodes and a total of 336 gigabytes of RAM. MapReduce framework: Hadoop (implements MapReduce + DFS) R (2.10.1) with tm (0.5-2) and hive (0.1-2) Code implementing DistributedCorpus in (tm.plugin.dc)
31 Figure: Runtime in seconds for stemming, stopword removal, and whitespace removal on the full Reuters data set (above) and on Benchmark Stemming Stopwords Whitespace Speedup Speedup Speedup #CPU #CPU #CPU Stemming Stopwords Whitespace Speedup Speedup Speedup #CPU #CPU #CPU
32 Outlook and Conclusion
33 Outlook - Computing on Texts > library("slam") > cs <- col_sums(reuters21578_dtm) > (top20 <- head(sort(cs, decreasing = TRUE), n = 20)) mln dlrs reuter pct compani bank billion share cts market price trade inc net stock corp loss rate sale oper > DTM_tfidf <- weighttfidf(reuters21578_dtm) > DTM_tfidf A document-term matrix (21578 documents, terms) Non-/sparse entries: / Sparsity : 100% Maximal term length: 30 Weighting : term frequency - inverse document frequency (normalized) (t > cs_tfidf <- col_sums(dtm_tfidf) > cs_tfidf[names(top20)] mln dlrs reuter pct compani bank billion share cts market price trade inc net stock corp loss rate sale oper
34 Outlook - Sentiment Analysis Compute sentiment scores based on e.g., daily news Use (normalized) TF-IDF scores Currently General Inquirer tag categories are employed (provided in package tm.plugin.tags) Construct time series of tagged documents > library("tm.plugin.tags") > head(tm_get_tags("positiv", collection = "general_inquirer")) [1] "abide" "ability" "able" "abound" "absolve" "absorbent Compare with time series of interest (e.g., of a financial instrument)
35 Figure: Time series of scores (normalized TF) for barrel-crude-oil, sentiment scores, and the stock price of CBT. Outlook - Sentiment Analysis score crude barrel oil date score Positiv Negativ Rise Fall Stay date GSPC$GSPC.Close Feb Apr May Jun Jul Aug Sep Oct 01 Oct
36 Conclusion - tm and Hadoop Use of Hadoop in particular the DFS enhances handling of large corpora Significant speedup in text mining applications Thus, MapReduce has proven to be a useful abstraction Greatly simplifies distributed computing Developer focus on problem Implementations like Hadoop deal with messy details different approaches to facilitate Hadoop s infrastructure language- and use case dependent
37 Conclusion - Text Mining in R The complete text mining infrastructure consists of many components: tm, text mining package ( , Feinerer, 2010) slam, sparse lightweigt arrays and matrices (0.1-11, Hornik et.al., 2010) tm.plugin.dc, distributed corpus plugin (0.1-2, Theussl and Feinerer, 2010) tm.plugin.tags, tag category database (0.0-1, Theussl, 2010) hive, Hadoop/MapReduce interface (0.1-5, Theussl and Feinerer, 2010) Two of them are are released on CRAN (tm, slam), two are currently in an advanced development stage on R-Forge in project RHadoop (hive, tm.plugin.dc), and one will be released shortly (tm.plugin.tags). Eventually, combine everything in a news mining package
38 Thank You for Your Attention! Stefan Theußl Institute for Statistics and Mathematics URL: WU Vienna Augasse 2 6, A-1090 Wien
39 References J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation, pages , URL I. Feinerer. tm: Text Mining Package, URL R package version I. Feinerer, K. Hornik, and D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1 54, March ISSN URL S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29 43, New York, NY, USA, October ACM Press. doi: K. Hornik, D. Meyer, and C. Buchta. slam: Sparse Lightweight Arrays and Matrices, URL R package version M. Porter. An algorithm for suffix stripping. Program, 3: , A. Scharl and A. Weichselbraun. An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology and Politics, 5(1): , 2008.
Distributed Text Mining with tm
Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Department of Statistics and Mathematics, WU Vienna University of Economics and Business 1 Institute of Information Systems,
Simple Parallel Computing in R Using Hadoop
Simple Parallel Computing in R Using Hadoop Stefan Theußl WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna [email protected] 30.06.2009 Agenda Problem & Motivation The MapReduce
Distributed Computing with Hadoop
Distributed Computing with Hadoop Stefan Theußl, Albert Weichselbraun WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna [email protected] [email protected] 15. May
Package hive. January 10, 2011
Package hive January 10, 2011 Version 0.1-9 Date 2011-01-09 Title Hadoop InteractiVE Description Hadoop InteractiVE, is an R extension facilitating distributed computing via the MapReduce paradigm. It
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Package hive. July 3, 2015
Version 0.2-0 Date 2015-07-02 Title Hadoop InteractiVE Package hive July 3, 2015 Description Hadoop InteractiVE facilitates distributed computing via the MapReduce paradigm through R and Hadoop. An easy
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Introduction to basic Text Mining in R.
Introduction to basic Text Mining in R. As published in Benchmarks RSS Matters, January 2014 http://web3.unt.edu/benchmarks/issues/2014/01/rss-matters Jon Starkweather, PhD 1 Jon Starkweather, PhD [email protected]
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Text Mining with R Twitter Data Analysis 1
Text Mining with R Twitter Data Analysis 1 Yanchang Zhao http://www.rdatamining.com R and Data Mining Workshop for the Master of Business Analytics course, Deakin University, Melbourne 28 May 2015 1 Presented
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
How To Write A Blog Post In R
R and Data Mining: Examples and Case Studies 1 Yanchang Zhao [email protected] http://www.rdatamining.com April 26, 2013 1 2012-2013 Yanchang Zhao. Published by Elsevier in December 2012. All rights
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Collaborative Software Development Using R-Forge
Collaborative Software Development Using R-Forge Stefan Theußl Achim Zeileis Kurt Hornik Department of Statistics and Mathematics Wirtschaftsuniversität Wien August 13, 2008 Why Open Source? Source code
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data
CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams
2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Natural Language Processing: A Model to Predict a Sequence of Words
Natural Language Processing: A Model to Predict a Sequence of Words Gerald R. Gendron, Jr. Confido Consulting Spot Analytic Chesapeake, VA [email protected]; LinkedIn: jaygendron ABSTRACT With the
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
A Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
