Crunching Big Data with R And Hadoop!

Transcription

1 1 Crunching Big Data with R And Hadoop! Strata/Hadoop World NYC 2012 Flash drives with tutorial materials are near the door, please start downloading the tutorial materials onto your laptop. There is a PDF with instructions on the flash drives. 23 October 2012

2 Part 3: Trending Topics Part 2: Scalability Part 1: Basics 2 Today s Agenda Introduction What is Hadoop What is Rhadoop Basic R syntax for the uninitiated Word count example Scalability Tradeoffs and Tabular Data Matrix Multiplication Example Break Dealing with zipfian distribution of data Advanced word count example Intro to LDA Description of Problem LDA Trending Topics Example Wrap-Up and Questions

3 3 Sec/on 1 Introduc/on to RHadoop Discussion of Hadoop, R, and Map/Reduce R for Hadoop Users Rhadoop s take on Map/Reduce Word Count Exercise

4 4 What is RHadoop? IntegraBon between the R stabsbcal package and Hadoop s Distributed File System and Map/ Reduce ComputaBon Engine Moves algorithm execubon to data Provides access to lots of high- quality stabsbcal libraries Speeds work by processing in parallel

5 Use RHadoop: For data explorabon To run a bunch of tasks in parallel To sort data To sample data To join data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 5

6 6 Don t Use RHadoop: To implement you next ultra- advanced, high- performance machine learning algorithm based on Map/Reduce

7 Don t Be Embarrassed to be Parallel Overall philosophy Use simple methods first (Or Serial) Use complicated methods only if simple ones fail All examples are reasonably straighvorward This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 7

8 8 The (Big) Data Science Lifecycle Sample data Because there s not usually a reason to use the whole data set Model data Modeling will help you understand where interesbng things are Interpret Find trends, iterate Sample/Summarize Model Interpret

9 9 Hadoop and its Components Runs over many machines Highly scalable architecture MapReduce Distributes processing to data HDFS Holds your data Hadoop Map/ Reduce HDFS Parallel Processing Moves to Data

10 What is MapReduce? Popularized in Google s 2004 paper Makes wribng distributed algorithms easier Removes freedom (synchronizabon, locking) to achieve safety This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 10

11 11 MapReduce Introduc/on All data is represented as a key- value pair Two major phases Map Reads the input data and outputs intermediate key- value pairs Reduce Values with the same key are sent to the same reducer and (opbonally) summarized

12 12 Map Phase How much wood would a woodchuck chuck if a woodchuck could chuck wood? Raw Data Mapper how 1 much 1 How much wood would a woodchuck chuck if a woodchuck could chuck wood Mapper Mapper wood 1 would 1 a 1 woodchuck 1 chuck 1 if 1 a 1...

13 13 Reduce Phase How much wood could a woodchuck chuck if a woodchuck could chuck wood? Mapper how 1 Reducer a 2 much 1 how 1 wood 1 woodchuck 2 Mapper would 1 Reducer wood 2 a 1 would 1 woodchuck 1 Mapper chuck 1 Reducer chuck 2 if 1 if 1 a 1 much

14 14 A Brief Introduc/on to R We ll quickly walk through ~/Examples/ Solutions/r-intro.R from your virtual machine

15 15 MapReduce in RHadoop Please refer to ~/Examples/Solutions/ rmr-intro.r on your virtual machine"

16 Intro to RHadoop and Word Count Simple word count solubon can be found in ~/Examples/Solutions/" Simple word count parbal solubon is available in ~/Examples/Fill-In/" This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 16

17 Recap RHadoop advantages over Java MapReduce Many fewer lines of code Make use of exisbng R funcbons from within Map/Reduce This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 17

18 18 Sec/on 2 Breaking Things Up CompuBng over tabular data scalability challenges Matrix MulBplicaBon Exercise MapReduce and Zipfian DistribuBon Zipfian DistribuBon Exercise

19 19 Tabular Data Ubiquitous Financial informabon Medical informabon Surveys Market reports Anything you put in Excel Can get big quickly due to the denormalized nature of tables in analybcs

20 Map/Reduce Techniques for Dealing with Tabular Big Data The situabons where you need to do this are rare but painful when you arrive at them! Why might you do this? Exploratory analysis, kitchen sink method using covariance matrix This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 20

21 21 Example: Correla/on Matrix for FRED Economic Time Series Data Correla/on Matrix, Sorted by Magnitude Horizontally Cool Things: Random collecbon of economic variables Groupings are shown in the natural ordering Relatedness to other things Un- relatedness to other things We can quickly idenbfy data that is unrelated to much of anything We can quickly discover outliers

22 22 FRED Data: General Outcomes LocaBon is closely related to correlabon Earnings and the civilian labor force variables are strongly negabvely correlated with residence adjustments Government employment is correlated with the civilian labor force and with educabonal and health services employment CorrelaBon between heabng oil in New York Harbor and Jet Fuel on Gulf Coast

23 FRED Data: General Outcomes The unemployment rate is negabvely correlated with hours worked by manufacturing personnel, only in some regions Per- capita personal income and foreign travel services are strongly negabvely correlated This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 23

24 Tradeoffs for Large Matrix Opera/ons Like Correla/on Matrix Tradeoffs must be made for different types of operabons As you break the problem down and increase scalability, you incur more overhead/latency from structure metadata We ll use matrix mulbplicabon as an illustrabve example of tradeoffs Map/reduce actually isn t great at represenbng matrix mulbplicabon Hence this is primarily illustrabve This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 24

25 25 Represen/ng Tabular Data Two main ways Row (or column) oriented Cell oriented Row/Column oriented Column (or row) has to fit in memory Olen need to transpose data (a map/reduce job) to do tasks like mulbplicabon Cell oriented Can be used to accomplish any type of operabon without need to change orientabon OperaBons are 100% disk based, therefore highly scalable Tends to be slow for denser matrices because of the excess amounts of data involved

26 26 An Illustra/ve Example: Matrix Mul/plica/on m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... p n X m B 2,1 B 3,1 = AB 1,1 AB 1,2 B 4,1... n A 1,1 B A 1,2 A A 1,1 + B 2,1 + 1,3 B 3,1 + 1,4 B 4,1 A 1,1 B A 1,2 A A 1,2 + B 2,2 + 1,3 B 3,2 + 1,4 B 4,2

27 27 Matrix Mul/plica/on: First Approach Map Phase Key: AB 1,1 n A 1,1... Value: A 1,2 A 1,3 A 1,4 X B 1,1 B 2,1 B 3,1 B 4,1 Key: Value: A 1,1 A 1,2 A 1,3 A 1,4 AB 1,1 B 1,1 B 2,1 B B 3,1 4,1 Σ m m Reduce Phase AB 1,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B B 3,1 4,1 n... X p m AB 1,1... p m...

28 28 m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... Matrix Mul/plica/on: Another Approach Map Phase Reduce Phase n Key: AB 1,1 AB 1,1 AB 1, AB 1,1 Σ Value: A 1,1 A 1,2... X X Key: AB 1,1 B 2,1 B 3,1 B 4,1 Value: B 1,1 B 2,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B 3,1 B 4,1 AB 1,1... AB 1,1 A 1,4 B A 1,1 B A 1,2 A 1,1 4,1 B 2,1 B 3,1 1,3 m p m m n...

29 29 Break We ll take a quick break then come back together for the matrix mulbplicabon exercise

30 30 Matrix Mul/plica/on Walkthrough See virtual machine materials

31 Zipfian Distribu/on Similar to Pareto, power law, exponenbal In this case, one thing takes a lot longer than others Perfect example: word count This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 31

32 32 Zipfian Distribu/ons are Everywhere Types of Data Geographic data Economic data Space/Telescope observabons Shanghai at Night (Courtesy NASA)

33 33 Map/Reduce Task Comple/on Time Reduce tasks Because of RMR's design, which utilizes a list of values instead of streaming them, this can actually make it impossible for a task to complete Process Completion Time

34 34 Strategies for Dealing with Zipfian Distribu/ons Throw out things that occur frequently stop words Break things down further Sample data Model distribubon beforehand For highfrequency keys, parbbon them out to mulbple tasks

35 35 How to Sample Data It depends what is the distribubon of your data? How good does your sample need to be? Approaches Read the first few entries of the first file Read the top few entries of each input file If input is splioable then you can read top few entries of each input split (in Java) If you can t do any of these you have to read the whole file, which can be expensive Remember stuff in the head of a zipfian distribubon is REALLY prevalent, so just reading the first few records works 90% of the Bme

36 36 Breaking Up High- Frequency Tasks For head tasks, prefix a random number between 0 and the number of reduce slots in your cluster Use the number 0 in the composite key for non- head tasks

37 37 Breaking Up High- Frequency Tasks original key Yes Is the key a highfrequency key? No random 0 original key original key First Reduce original key Second Reduce final output

38 38 Exercise: Zipfian Distribu/on See virtual machine materials

39 Sec/on 3 Topic Modeling Overview of Topic Modeling and LDA Sliding Window Analysis Trending Twioer Topics Exercise This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 39

40 Topic Modeling Methods for discovering topics in a collecbon of documents Used for machine learning, natural language processing Several algorithms for topic modeling available This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 40

41 Latent Dirichlet Alloca/on (LDA) Common topic model Allows documents to have a mixture of topics Each topic represented by a list of terms Available in R s topicmodels package This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 41

42 Trending Topics and Twi^er Trending topics are a popular feature of Twioer Trending topics can be found using many methods; one simple method is term frequency LDA is an advanced method that allows topics to be determined by the content of a set of Tweets CompuBng LDA over millions of Tweets will not be possible on a single home computer This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 42

43 Implemen/ng LDA using RHadoop SoluBon 1: Parallelize LDA to split computabons over mulbple machines. Computes one topic model for the whole data set Requires in- depth understanding of LDA algorithm This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 43

44 Implemen/ng LDA using RHadoop SoluBon 2: Analysis of trending topics over Bme CompuBng LDA over Tweets according to Bme, we can idenbfy changes in topics over Bme RHadoop will allow us to run LDA over mulbple Bme slices in parallel We can make use of R s topicmodels package to run LDA on our data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 44

45 45 Sample Tweet {" "profile_image_url":" profile_image.jpg"," "from_user_name":"donald Duck"," "from_user_id_str":" "," "created_at":"fri, 10 Aug :30: "," "id_str":" "," "from_user":"big_quack"," "to_user_id":0," "text":"what s the Big "metadata":{"result_type":"recent"}," "profile_image_url_https":" }" profile_image.jpg "," "id": ," "to_user":null," "geo":null," "from_user_id": ," "to_user_name":null," "iso_language_code":"en"," "to_user_id_str":"0"," "source":" "

46 46 Sample Tweet {" " "created_at":"fri, 10 Aug :30: "," " "text":"what s the Big " }"

47 47 Sliding Window Analysis Tweets can be grouped according to what hour (or day, month, etc.) they were created 12:00 13:00 14:00 Tweet1 (12:23) Tweet2 (12:54) Tweet3 (13:02) Tweet4 (13:47) Tweet5 (14:17) More overlap in Bme windows will smooth the change in trending topics over Bme 11:30 12:00 12:30 13:00 13:30 14:00 Tweet1 (12:23) Tweet1 (12:23) Tweet2 (12:54) Tweet2 (12:54) Tweet3 (13:02) Tweet3 (13:02) Tweet4 (13:47) Tweet4 (13:47) Tweet5 (14:17) Tweet5 (14:17)

48 Sliding Window Times Simple, non- overlapping hourly windows can be created by dividing the UTC Bme in seconds by 3600, rounding down, and using that as the key Ex: Fri, 10 Aug :30: = This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 48

49 49 Using the LDA Libraries Step 1: Create a Document Source: VectorSource( )" Constructs a source for a vector" Step 2: Create a Corpus: Corpus( )" Takes a Source object and constructs a corpus" Step 3: Create a document- term matrix DocumentTermMatrix( )" Takes a Corpus and constructs a document term matrix" Step 4: Create a topic model with n topics LDA(,n)" Takes a DocumentTermMatrix and a number of topics and computes an LDA model" Step 5: Retrieve topics and terms using the respecbve convenience funcbons topics( ) terms( )"

50 50 Exercise: Trending Tweets See virtual machine materials

51 51 Recap RHadoop provides tools for scalable big data analybcs that Can be quickly prototyped in fewer lines of code Word Count Fit our (Big) Data Science Lifecycle Word Count + Zipfian DistribuBon Make use of R s extensive stabsbcal and data modeling libraries Trending Twioer Topics Analysis with LDA

52 Acknowledgments Special thanks to RevoluBon AnalyBcs and Antonio Piccolboni for developing and sharing Rhadoop and fielding quesbons! Thanks to our Booz Allen Helpers: Josh Sullivan, Doug Gartner, Jay Owen, Brian Bende, Drew Farris, Charles Glover, Paul Tamburello, Michael Kim, and Jay Shipper This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 52

53 53 For more informa/on Office hours at 2:30 pm on Wednesday Come visit our exhibitor booth - Sponsor Pavilion #59 Visit boozallen.com Ed Kohlwey: E- mail: kohlwey_edmund@bah.com Stephanie Beben: E- mail: beben_stephanie@bah.com

54 54 Appendix

55 55 Installa/on (The Old Fashioned Way) git clone -b rmr RevolutionAnalytics/RHadoop.git RHadoop-src " echo 'dir.create(.libpaths() [1],recursive=T);install.packages(c("digest","it ertools","functional","rjsonio","rcpp"),repos="h ttp://cran.us.r-project.org")' R --no-save -- no-resume" " R CMD INSTALL RHadoop-src/quickcheck/ RHadoopsrc/rmr/pkg/"