Massive scale analytics with Stratosphere using R

Size: px

Start display at page:

Download "Massive scale analytics with Stratosphere using R"

Rachel Webb
8 years ago
Views:

1 Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino jllopezpino@gmail.com Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised by Marcus Leich, Kostas Tzoumas August 28, 2014

com Database Systems and Information Management Technische

2 Introduction Jose Luis Lopez Pino 2

3 Data analysis to the masses Deep analytics 1 : sophisticated statistical methods like linear models, clustering or classification that frequently are used to extract knowledge from the data. Data warehousing and BI can t answer all the questions. The ever-growing number of new data sources and tools make it worse. There is demand for this questions. In small scale: data pipelining tools (RapidMiner) and numerical computing environments (R, Matlab or SPSS). Big data brings new opportunities to the market but also presents unfamiliar challenges. 1 Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, 2010 Jose Luis Lopez Pino jllopezpino@gmail.com 3

In small scale: data pipelining tools (RapidMiner) and numerical computing environments (R, Matlab or SPSS). Big data brings new opportunities to the market but also presents unfamiliar challenges.

4 Options R: R is a numerical computing environment and DSL for stats. Not a query language unlike SQL. Succesful for small scale (in combination with CRAN packages). MapReduce/Hadoop: Highly parallel programs but lack of expressivity. HDFS: a de-facto standard to store big amounts of data. Stratosphere: Platform for massively parallel computing / big data analytics. PACT: MapReduce + New operators + Iterations. Jose Luis Lopez Pino jllopezpino@gmail.com 4

MapReduce/Hadoop: Highly parallel programs but lack of expressivity.

5 Basic terms and definitions KDD is compound of nine steps: understanding the domain and the goals, creating the target source, cleaning and processing the source, data reduction and projection, choosing a data mining method, choosing the data mining algorithm, mining the data, interpretation of the patterns. Figure: Overview of the process 2 2 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996 Jose Luis Lopez Pino jllopezpino@gmail.com 5

interpretation of the patterns. Figure: Overview of the process 2 2 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth.

6 Motivation Jose Luis Lopez Pino 6

7 Clustering Jose Luis Lopez Pino 7

8 Classification Jose Luis Lopez Pino 8

9 Frequent Terms Jose Luis Lopez Pino 9

10 Writing massively parallel programs It is a cumbersome and onerous process. We need of single tools. We need tools that can process from a small amount of data up to very large volumes. The majority of data researchers are strongly skilled in R and statistics and poorly skills in Big Data systems and implementation of machine learning algorithm. 3 4 Although Stratosphere offers a more expressive interface, writing a parallel program is still not a trivial job. 3 Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O Reilly Media, Inc., Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18(12): , 2012 Jose Luis Lopez Pino jllopezpino@gmail.com 10

3 4 Although Stratosphere offers a more expressive interface, writing a parallel program is still not a trivial job. 3 Harlan Harris, Sean Murphy, and Marck Vaisman.

11 Relation with the KDD process Data extraction is covered by other solutions. Pre-processing and transformation seem difficult. Data mining: where we have a competitive advantage. Data visualization is a different problem. Jose Luis Lopez Pino jllopezpino@gmail.com 11

Data mining: where we have a competitive advantage.

12 Design goals Easiness: ready-to-use algorithms. Design a library. Facilitate working with data. Easy to distribute. Focus on algorithms that scale. Jose Luis Lopez Pino jllopezpino@gmail.com 12

13 Our approach Jose Luis Lopez Pino 13

14 Architecture Jose Luis Lopez Pino 14

15 Architecture Jose Luis Lopez Pino 15

16 Library: Goals Classification, clustering and regression. No Free Lunch Theorem: more than one algorithm. Presence in other ML libraries. Large-scale. Ensemble scenarios. Jose Luis Lopez Pino 16

17 Library: Example Jose Luis Lopez Pino 17

18 R package Easy to distribute. Organized in namespaces. Submitting jobs to the cluster. Working with files. Mining. Configuration. Jose Luis Lopez Pino 18

19 Introduction Motivation Our approach Related work Conclusions and Future Work Example: Code Jose Luis Lopez Pino 19

20 Example: Non-parallel classification example

21 Example: Parallel classification example

22 Example: Parallel clustering example

23 Performance Competitive and even faster than native R programs thanks to the pipelining for every parallelizable programs in the same (small) file size range. Competitive with R for data mining tasks with a lot of iterations in the same file size range. Able to process files of a volume that is inaccessible for R. Able to scale to gigabyte level without significant loss. Jose Luis Lopez Pino jllopezpino@gmail.com 23

24 Performance: Frequent Terms example Jose Luis Lopez Pino 24

25 Performance: Most favorable case to R Figure: KMeans 100 iterations

26 Performance: Breakdown example Figure: Clustering example nonparallel breakdown (Time in seconds) Jose Luis Lopez Pino 26

27 Performance: Scalability example Figure: Frequent Terms parallel scalability Jose Luis Lopez Pino 27

28 Related work Jose Luis Lopez Pino 28

29 Data mining libraries Don t scale: Weka and sci-kit. Large-scale:. Mahout: limited set of problems. MLlib: also facilitates implementation of new algorithms. Oryx. In-database: MADlib and PivotalR. Jose Luis Lopez Pino jllopezpino@gmail.com 29

30 Data intensive computation with R External memory. Don t scale-out: biglm, bigmemory, ff, foreach. RevoScaleR: xdf files and Hadoop. Divide and recombine: it s necessary to use the MR model. Query languages: Limited expressivity. Good for the first step of the KDD process. Distributed collection manipulation: Limited set of operators. Presto and SparkR. Jose Luis Lopez Pino jllopezpino@gmail.com 30

31 Conclusions and Future Work Jose Luis Lopez Pino 31

32 Conclusion Contributions:. Library definition. File manipulation and cluster interaction. Scenarios that proof the concept. Code very similar to the original one. Promising performance evaluation. Jose Luis Lopez Pino 32

33 Future work Improvements in the library. Hybrid approaches. Distributed evaluation. Improvements in the architecture. Jose Luis Lopez Pino 33

34 Essential bibliography Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, 2010 Hadley Wickham. Advanced R Programming. CRC Press, To appear Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinlnder, MatthiasJ. Sax, Sebastian Schelter, Mareike Hger, Kostas Tzoumas, and Daniel Warneke. The stratosphere platform for big data analytics. The VLDB Journal, pages 1 26, 2014 Hai Qian. Pivotalr: A package for machine learning on big data. The R Journal, 6(1):57 67, June 2014 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996

Recap 1 Introduction Data analysis to the masses Options Basic terms and definitions 2 Motivation Motivating problems Writing massively parallel programs Relation with the KDD process Design goals 3

35 Recap 1 Introduction Data analysis to the masses Options Basic terms and definitions 2 Motivation Motivating problems Writing massively parallel programs Relation with the KDD process Design goals 3 Our approach Architecture Library R package Example Performance 4 Related work Data mining libraries Data intensive computation with R 5 Conclusions and Future Work Conclusion Future work Essential bibliography

Analysis Pipelines for Benchmarking Big Data Systems

Analysis Pipelines for Benchmarking Big Data Systems Thomas Bodner thomas.o.bodner@campus.tu-berlin.de Ref. Code: Berlin_EN_2182, Suggested starting date: May 15, 2013 Today, practically everyone ranging