Crunching Big Data with R And Hadoop!



Similar documents
Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Hadoop Parallel Data Processing

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Similarity Search in a Very Large Scale Using Hadoop and HBase

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Developing MapReduce Programs

Energy Efficient MapReduce

Analysis of MapReduce Algorithms

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data processing goes big

Mammoth Scale Machine Learning!

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Click Stream Data Analysis Using Hadoop

Estimating PageRank Values of Wikipedia Articles using MapReduce

How To Use Big Data For Telco (For A Telco)

A very short Intro to Hadoop

Open source Google-style large scale data analysis with Hadoop

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

How To Scale Out Of A Nosql Database

Integrating Big Data into the Computing Curricula

Open source large scale distributed data management with Google s MapReduce and Bigtable

BIG DATA What it is and how to use?

Hadoop SNS. renren.com. Saturday, December 3, 11

Big Data With Hadoop

Hadoop Distributed File System (HDFS) Overview

Introduction to Hadoop


Big Data and Scripting map/reduce in Hadoop

Hadoop and Map-reduce computing

Distributed Computing and Big Data: Hadoop and MapReduce

A programming model in Cloud: MapReduce

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Massive Cloud Auditing using Data Mining on Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop Architecture. Part 1

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop WordCount Explained! IT332 Distributed Systems

ITG Software Engineering

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

What s New in MATLAB and Simulink

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Image Search by MapReduce

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Cloud Scale Distributed Data Storage. Jürmo Mehine

Bringing Big Data Modelling into the Hands of Domain Experts


Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data and Apache Hadoop s MapReduce

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Fast Data in the Era of Big Data: Twitter s Real-

Introduction to Parallel Programming and MapReduce

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Introduction To Hive

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop IST 734 SS CHUNG

Hadoop Project for IDEAL in CS5604

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Reference Architecture, Requirements, Gaps, Roles

Cloud Computing at Google. Architecture

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Introduction to DISC and Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

University of Maryland. Tuesday, February 2, 2010

Hadoop Ecosystem B Y R A H I M A.

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Hadoop and Map-Reduce. Swati Gore

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Transcription:

1 Crunching Big Data with R And Hadoop! Strata/Hadoop World NYC 2012 Flash drives with tutorial materials are near the door, please start downloading the tutorial materials onto your laptop. There is a PDF with instructions on the flash drives. 23 October 2012

Part 3: Trending Topics Part 2: Scalability Part 1: Basics 2 Today s Agenda Introduction What is Hadoop What is Rhadoop Basic R syntax for the uninitiated Word count example Scalability Tradeoffs and Tabular Data Matrix Multiplication Example Break Dealing with zipfian distribution of data Advanced word count example Intro to LDA Description of Problem LDA Trending Topics Example Wrap-Up and Questions

3 Sec/on 1 Introduc/on to RHadoop Discussion of Hadoop, R, and Map/Reduce R for Hadoop Users Rhadoop s take on Map/Reduce Word Count Exercise

4 What is RHadoop? IntegraBon between the R stabsbcal package and Hadoop s Distributed File System and Map/ Reduce ComputaBon Engine Moves algorithm execubon to data Provides access to lots of high- quality stabsbcal libraries Speeds work by processing in parallel

Use RHadoop: For data explorabon To run a bunch of tasks in parallel To sort data To sample data To join data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 5

6 Don t Use RHadoop: To implement you next ultra- advanced, high- performance machine learning algorithm based on Map/Reduce

Don t Be Embarrassed to be Parallel Overall philosophy Use simple methods first (Or Serial) Use complicated methods only if simple ones fail All examples are reasonably straighvorward This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 7

8 The (Big) Data Science Lifecycle Sample data Because there s not usually a reason to use the whole data set Model data Modeling will help you understand where interesbng things are Interpret Find trends, iterate Sample/Summarize Model Interpret

9 Hadoop and its Components Runs over many machines Highly scalable architecture MapReduce Distributes processing to data HDFS Holds your data Hadoop Map/ Reduce HDFS Parallel Processing Moves to Data

What is MapReduce? Popularized in Google s 2004 paper Makes wribng distributed algorithms easier Removes freedom (synchronizabon, locking) to achieve safety This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 10

11 MapReduce Introduc/on All data is represented as a key- value pair Two major phases Map Reads the input data and outputs intermediate key- value pairs Reduce Values with the same key are sent to the same reducer and (opbonally) summarized

12 Map Phase How much wood would a woodchuck chuck if a woodchuck could chuck wood? Raw Data Mapper how 1 much 1 How much wood would a woodchuck chuck if a woodchuck could chuck wood Mapper Mapper wood 1 would 1 a 1 woodchuck 1 chuck 1 if 1 a 1...

13 Reduce Phase How much wood could a woodchuck chuck if a woodchuck could chuck wood? Mapper how 1 Reducer a 2 much 1 how 1 wood 1 woodchuck 2 Mapper would 1 Reducer wood 2 a 1 would 1 woodchuck 1 Mapper chuck 1 Reducer chuck 2 if 1 if 1 a 1 much 1......

14 A Brief Introduc/on to R We ll quickly walk through ~/Examples/ Solutions/r-intro.R from your virtual machine

15 MapReduce in RHadoop Please refer to ~/Examples/Solutions/ rmr-intro.r on your virtual machine"

Intro to RHadoop and Word Count Simple word count solubon can be found in ~/Examples/Solutions/" Simple word count parbal solubon is available in ~/Examples/Fill-In/" This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 16

Recap RHadoop advantages over Java MapReduce Many fewer lines of code Make use of exisbng R funcbons from within Map/Reduce This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 17

18 Sec/on 2 Breaking Things Up CompuBng over tabular data scalability challenges Matrix MulBplicaBon Exercise MapReduce and Zipfian DistribuBon Zipfian DistribuBon Exercise

19 Tabular Data Ubiquitous Financial informabon Medical informabon Surveys Market reports Anything you put in Excel Can get big quickly due to the denormalized nature of tables in analybcs

Map/Reduce Techniques for Dealing with Tabular Big Data The situabons where you need to do this are rare but painful when you arrive at them! Why might you do this? Exploratory analysis, kitchen sink method using covariance matrix This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 20

21 Example: Correla/on Matrix for FRED Economic Time Series Data Correla/on Matrix, Sorted by Magnitude Horizontally Cool Things: Random collecbon of economic variables Groupings are shown in the natural ordering Relatedness to other things Un- relatedness to other things We can quickly idenbfy data that is unrelated to much of anything We can quickly discover outliers

22 FRED Data: General Outcomes LocaBon is closely related to correlabon Earnings and the civilian labor force variables are strongly negabvely correlated with residence adjustments Government employment is correlated with the civilian labor force and with educabonal and health services employment CorrelaBon between heabng oil in New York Harbor and Jet Fuel on Gulf Coast

FRED Data: General Outcomes The unemployment rate is negabvely correlated with hours worked by manufacturing personnel, only in some regions Per- capita personal income and foreign travel services are strongly negabvely correlated This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 23

Tradeoffs for Large Matrix Opera/ons Like Correla/on Matrix Tradeoffs must be made for different types of operabons As you break the problem down and increase scalability, you incur more overhead/latency from structure metadata We ll use matrix mulbplicabon as an illustrabve example of trade- offs Map/reduce actually isn t great at represenbng matrix mulbplicabon Hence this is primarily illustrabve This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 24

25 Represen/ng Tabular Data Two main ways Row (or column) oriented Cell oriented Row/Column oriented Column (or row) has to fit in memory Olen need to transpose data (a map/reduce job) to do tasks like mulbplicabon Cell oriented Can be used to accomplish any type of operabon without need to change orientabon OperaBons are 100% disk based, therefore highly scalable Tends to be slow for denser matrices because of the excess amounts of data involved

26 An Illustra/ve Example: Matrix Mul/plica/on m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... p n X m B 2,1 B 3,1 = AB 1,1 AB 1,2 B 4,1... n...... A 1,1 B A 1,2 A A 1,1 + B 2,1 + 1,3 B 3,1 + 1,4 B 4,1 A 1,1 B A 1,2 A A 1,2 + B 2,2 + 1,3 B 3,2 + 1,4 B 4,2

27 Matrix Mul/plica/on: First Approach Map Phase Key: AB 1,1 n A 1,1... Value: A 1,2 A 1,3 A 1,4 X B 1,1 B 2,1 B 3,1 B 4,1 Key: Value: A 1,1 A 1,2 A 1,3 A 1,4 AB 1,1 B 1,1 B 2,1 B B 3,1 4,1 Σ m m Reduce Phase AB 1,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B B 3,1 4,1 n... X p m AB 1,1... p m...

28 m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... Matrix Mul/plica/on: Another Approach Map Phase Reduce Phase n Key: AB 1,1 AB 1,1 AB 1,1... 1 2 AB 1,1 Σ Value: A 1,1 A 1,2... X X Key: AB 1,1 B 2,1 B 3,1 B 4,1 Value: B 1,1 B 2,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B 3,1 B 4,1 AB 1,1... AB 1,1 A 1,4 B A 1,1 B A 1,2 A 1,1 4,1 B 2,1 B 3,1 1,3 m p m m 1 2... n...

29 Break We ll take a quick break then come back together for the matrix mulbplicabon exercise

30 Matrix Mul/plica/on Walkthrough See virtual machine materials

Zipfian Distribu/on Similar to Pareto, power law, exponenbal In this case, one thing takes a lot longer than others Perfect example: word count This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 31

32 Zipfian Distribu/ons are Everywhere Types of Data Geographic data Economic data Space/Telescope observabons Shanghai at Night (Courtesy NASA)

33 Map/Reduce Task Comple/on Time Reduce tasks Because of RMR's design, which utilizes a list of values instead of streaming them, this can actually make it impossible for a task to complete Process Completion Time

34 Strategies for Dealing with Zipfian Distribu/ons Throw out things that occur frequently stop words Break things down further Sample data Model distribubon beforehand For high- frequency keys, parbbon them out to mulbple tasks

35 How to Sample Data It depends what is the distribubon of your data? How good does your sample need to be? Approaches Read the first few entries of the first file Read the top few entries of each input file If input is splioable then you can read top few entries of each input split (in Java) If you can t do any of these you have to read the whole file, which can be expensive Remember stuff in the head of a zipfian distribubon is REALLY prevalent, so just reading the first few records works 90% of the Bme

36 Breaking Up High- Frequency Tasks For head tasks, prefix a random number between 0 and the number of reduce slots in your cluster Use the number 0 in the composite key for non- head tasks

37 Breaking Up High- Frequency Tasks original key Yes Is the key a highfrequency key? No random 0 original key original key First Reduce original key Second Reduce final output

38 Exercise: Zipfian Distribu/on See virtual machine materials

Sec/on 3 Topic Modeling Overview of Topic Modeling and LDA Sliding Window Analysis Trending Twioer Topics Exercise This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 39

Topic Modeling Methods for discovering topics in a collecbon of documents Used for machine learning, natural language processing Several algorithms for topic modeling available This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 40

Latent Dirichlet Alloca/on (LDA) Common topic model Allows documents to have a mixture of topics Each topic represented by a list of terms Available in R s topicmodels package This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 41

Trending Topics and Twi^er Trending topics are a popular feature of Twioer Trending topics can be found using many methods; one simple method is term frequency LDA is an advanced method that allows topics to be determined by the content of a set of Tweets CompuBng LDA over millions of Tweets will not be possible on a single home computer This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 42

Implemen/ng LDA using RHadoop SoluBon 1: Parallelize LDA to split computabons over mulbple machines. Computes one topic model for the whole data set Requires in- depth understanding of LDA algorithm This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 43

Implemen/ng LDA using RHadoop SoluBon 2: Analysis of trending topics over Bme CompuBng LDA over Tweets according to Bme, we can idenbfy changes in topics over Bme RHadoop will allow us to run LDA over mulbple Bme slices in parallel We can make use of R s topicmodels package to run LDA on our data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 44

45 Sample Tweet {" "profile_image_url":"http:// profile_image.jpg"," "from_user_name":"donald Duck"," "from_user_id_str":"111222333"," "created_at":"fri, 10 Aug 2012 11:30:23 +0000"," "id_str":"123456789123456789"," "from_user":"big_quack"," "to_user_id":0," "text":"what s the Big Idea!? @Mickey"," "metadata":{"result_type":"recent"}," "profile_image_url_https":"https:// }" profile_image.jpg "," "id":123456789123456789," "to_user":null," "geo":null," "from_user_id":111222333," "to_user_name":null," "iso_language_code":"en"," "to_user_id_str":"0"," "source":"http://donalds_iphone "

46 Sample Tweet {" " "created_at":"fri, 10 Aug 2012 11:30:23 +0000"," " "text":"what s the Big Idea!? @Mickey"," " }"

47 Sliding Window Analysis Tweets can be grouped according to what hour (or day, month, etc.) they were created 12:00 13:00 14:00 Tweet1 (12:23) Tweet2 (12:54) Tweet3 (13:02) Tweet4 (13:47) Tweet5 (14:17) More overlap in Bme windows will smooth the change in trending topics over Bme 11:30 12:00 12:30 13:00 13:30 14:00 Tweet1 (12:23) Tweet1 (12:23) Tweet2 (12:54) Tweet2 (12:54) Tweet3 (13:02) Tweet3 (13:02) Tweet4 (13:47) Tweet4 (13:47) Tweet5 (14:17) Tweet5 (14:17)

Sliding Window Times Simple, non- overlapping hourly windows can be created by dividing the UTC Bme in seconds by 3600, rounding down, and using that as the key Ex: Fri, 10 Aug 2012 11:30:23 +0000 = 1344598223 This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 48

49 Using the LDA Libraries Step 1: Create a Document Source: VectorSource( )" Constructs a source for a vector" Step 2: Create a Corpus: Corpus( )" Takes a Source object and constructs a corpus" Step 3: Create a document- term matrix DocumentTermMatrix( )" Takes a Corpus and constructs a document term matrix" Step 4: Create a topic model with n topics LDA(,n)" Takes a DocumentTermMatrix and a number of topics and computes an LDA model" Step 5: Retrieve topics and terms using the respecbve convenience funcbons topics( ) terms( )"

50 Exercise: Trending Tweets See virtual machine materials

51 Recap RHadoop provides tools for scalable big data analybcs that Can be quickly prototyped in fewer lines of code Word Count Fit our (Big) Data Science Lifecycle Word Count + Zipfian DistribuBon Make use of R s extensive stabsbcal and data modeling libraries Trending Twioer Topics Analysis with LDA

Acknowledgments Special thanks to RevoluBon AnalyBcs and Antonio Piccolboni for developing and sharing Rhadoop and fielding quesbons! Thanks to our Booz Allen Helpers: Josh Sullivan, Doug Gartner, Jay Owen, Brian Bende, Drew Farris, Charles Glover, Paul Tamburello, Michael Kim, and Jay Shipper This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 52

53 For more informa/on Office hours at 2:30 pm on Wednesday Come visit our exhibitor booth - Sponsor Pavilion #59 Visit boozallen.com Ed Kohlwey: E- mail: kohlwey_edmund@bah.com Twioer: @ekohlwey Stephanie Beben: E- mail: beben_stephanie@bah.com Twioer: @StephanieBeben

54 Appendix

55 Installa/on (The Old Fashioned Way) git clone -b rmr-2.0 https://github.com/ RevolutionAnalytics/RHadoop.git RHadoop-src " echo 'dir.create(.libpaths() [1],recursive=T);install.packages(c("digest","it ertools","functional","rjsonio","rcpp"),repos="h ttp://cran.us.r-project.org")' R --no-save -- no-resume" " R CMD INSTALL RHadoop-src/quickcheck/ RHadoopsrc/rmr/pkg/"