Massive scale analytics with Stratosphere using R



Similar documents
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Workflow Management System for Stratosphere

Memory Management in BigData

SEIZE THE DATA SEIZE THE DATA. 2015

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Ontology construction on a cloud computing platform

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Architectures for Big Data Analytics A database perspective

PivotalR: A Package for Machine Learning on Big Data

Big Data Research in Berlin BBDC and Apache Flink

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Advanced In-Database Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Big Data Research in the AMPLab: BDAS and Beyond

The Stratosphere platform for big data analytics

Data Mining Solutions for the Business Environment

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Big Data Analytics. Chances and Challenges. Volker Markl

ANALYTICS CENTER LEARNING PROGRAM

HiBench Installation. Sunil Raiyani, Jayam Modi

Advanced Big Data Analytics with R and Hadoop

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study

The basic data mining algorithms introduced may be enhanced in a number of ways.

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

In-Memory Analytics for Big Data

Big Data. Lyle Ungar, University of Pennsylvania

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

What s next for the Berkeley Data Analytics Stack?

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data and Data Science: Behind the Buzz Words

An In-Depth Look at In-Memory Predictive Analytics for Developers

Apache Flink Next-gen data analysis. Kostas

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Chapter ML:XI. XI. Cluster Analysis

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

PAXQuery: A Massively Parallel XQuery Processor

Dynamic Data in terms of Data Mining Streams

Data Mining Analytics for Business Intelligence and Decision Support

Azure Data Lake Analytics

Data Science Certificate Program

An Introduction to Data Mining

RevoScaleR Speed and Scalability

Role Description. Position of a Data Scientist Machine Learning at Fractal Analytics

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Ecosystem B Y R A H I M A.

Introduction to Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Ali Ghodsi Head of PM and Engineering Databricks

SEIZE THE DATA SEIZE THE DATA. 2015

An interdisciplinary model for analytics education

How Companies are! Using Spark

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

International Journal of Innovative Research in Computer and Communication Engineering

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Interactive data analytics drive insights

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Radoop: Analyzing Big Data with RapidMiner and Hadoop

How To Get The Most Out Of Big Data

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

CSE-E5430 Scalable Cloud Computing Lecture 2

Federated Cloud-based Big Data Platform in Telecommunications

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Data Mining in the Swamp

From Spark to Ignition:

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Demonstration of SAP Predictive Analysis 1.0, consumption from SAP BI clients and best practices

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data and Analytics (Fall 2015)

Analytics on Big Data

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hybrid Software Architectures for Big

This Symposium brought to you by

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Sunnie Chung. Cleveland State University

Navigating Big Data business analytics

In-Database Analytics

SQL Server 2005 Features Comparison

Massive Cloud Auditing using Data Mining on Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Transcription:

Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino jllopezpino@gmail.com Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised by Marcus Leich, Kostas Tzoumas August 28, 2014

Introduction Jose Luis Lopez Pino jllopezpino@gmail.com 2

Data analysis to the masses Deep analytics 1 : sophisticated statistical methods like linear models, clustering or classification that frequently are used to extract knowledge from the data. Data warehousing and BI can t answer all the questions. The ever-growing number of new data sources and tools make it worse. There is demand for this questions. In small scale: data pipelining tools (RapidMiner) and numerical computing environments (R, Matlab or SPSS). Big data brings new opportunities to the market but also presents unfamiliar challenges. 1 Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 987 998. ACM, 2010 Jose Luis Lopez Pino jllopezpino@gmail.com 3

Options R: R is a numerical computing environment and DSL for stats. Not a query language unlike SQL. Succesful for small scale (in combination with CRAN packages). MapReduce/Hadoop: Highly parallel programs but lack of expressivity. HDFS: a de-facto standard to store big amounts of data. Stratosphere: Platform for massively parallel computing / big data analytics. PACT: MapReduce + New operators + Iterations. Jose Luis Lopez Pino jllopezpino@gmail.com 4

Basic terms and definitions KDD is compound of nine steps: understanding the domain and the goals, creating the target source, cleaning and processing the source, data reduction and projection, choosing a data mining method, choosing the data mining algorithm, mining the data, interpretation of the patterns. Figure: Overview of the process 2 2 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996 Jose Luis Lopez Pino jllopezpino@gmail.com 5

Motivation Jose Luis Lopez Pino jllopezpino@gmail.com 6

Clustering Jose Luis Lopez Pino jllopezpino@gmail.com 7

Classification Jose Luis Lopez Pino jllopezpino@gmail.com 8

Frequent Terms Jose Luis Lopez Pino jllopezpino@gmail.com 9

Writing massively parallel programs It is a cumbersome and onerous process. We need of single tools. We need tools that can process from a small amount of data up to very large volumes. The majority of data researchers are strongly skilled in R and statistics and poorly skills in Big Data systems and implementation of machine learning algorithm. 3 4 Although Stratosphere offers a more expressive interface, writing a parallel program is still not a trivial job. 3 Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O Reilly Media, Inc., 2013 4 Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18(12):2917 2926, 2012 Jose Luis Lopez Pino jllopezpino@gmail.com 10

Relation with the KDD process Data extraction is covered by other solutions. Pre-processing and transformation seem difficult. Data mining: where we have a competitive advantage. Data visualization is a different problem. Jose Luis Lopez Pino jllopezpino@gmail.com 11

Design goals Easiness: ready-to-use algorithms. Design a library. Facilitate working with data. Easy to distribute. Focus on algorithms that scale. Jose Luis Lopez Pino jllopezpino@gmail.com 12

Our approach Jose Luis Lopez Pino jllopezpino@gmail.com 13

Architecture Jose Luis Lopez Pino jllopezpino@gmail.com 14

Architecture Jose Luis Lopez Pino jllopezpino@gmail.com 15

Library: Goals Classification, clustering and regression. No Free Lunch Theorem: more than one algorithm. Presence in other ML libraries. Large-scale. Ensemble scenarios. Jose Luis Lopez Pino jllopezpino@gmail.com 16

Library: Example Jose Luis Lopez Pino jllopezpino@gmail.com 17

R package Easy to distribute. Organized in namespaces. Submitting jobs to the cluster. Working with files. Mining. Configuration. Jose Luis Lopez Pino jllopezpino@gmail.com 18

Introduction Motivation Our approach Related work Conclusions and Future Work Example: Code Jose Luis Lopez Pino jllopezpino@gmail.com 19

Example: Non-parallel classification example

Example: Parallel classification example

Example: Parallel clustering example

Performance Competitive and even faster than native R programs thanks to the pipelining for every parallelizable programs in the same (small) file size range. Competitive with R for data mining tasks with a lot of iterations in the same file size range. Able to process files of a volume that is inaccessible for R. Able to scale to gigabyte level without significant loss. Jose Luis Lopez Pino jllopezpino@gmail.com 23

Performance: Frequent Terms example Jose Luis Lopez Pino jllopezpino@gmail.com 24

Performance: Most favorable case to R Figure: KMeans 100 iterations

Performance: Breakdown example Figure: Clustering example nonparallel breakdown (Time in seconds) Jose Luis Lopez Pino jllopezpino@gmail.com 26

Performance: Scalability example Figure: Frequent Terms parallel scalability Jose Luis Lopez Pino jllopezpino@gmail.com 27

Related work Jose Luis Lopez Pino jllopezpino@gmail.com 28

Data mining libraries Don t scale: Weka and sci-kit. Large-scale:. Mahout: limited set of problems. MLlib: also facilitates implementation of new algorithms. Oryx. In-database: MADlib and PivotalR. Jose Luis Lopez Pino jllopezpino@gmail.com 29

Data intensive computation with R External memory. Don t scale-out: biglm, bigmemory, ff, foreach. RevoScaleR: xdf files and Hadoop. Divide and recombine: it s necessary to use the MR model. Query languages: Limited expressivity. Good for the first step of the KDD process. Distributed collection manipulation: Limited set of operators. Presto and SparkR. Jose Luis Lopez Pino jllopezpino@gmail.com 30

Conclusions and Future Work Jose Luis Lopez Pino jllopezpino@gmail.com 31

Conclusion Contributions:. Library definition. File manipulation and cluster interaction. Scenarios that proof the concept. Code very similar to the original one. Promising performance evaluation. Jose Luis Lopez Pino jllopezpino@gmail.com 32

Future work Improvements in the library. Hybrid approaches. Distributed evaluation. Improvements in the architecture. Jose Luis Lopez Pino jllopezpino@gmail.com 33

Essential bibliography Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 987 998. ACM, 2010 Hadley Wickham. Advanced R Programming. CRC Press, 2014. To appear Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinlnder, MatthiasJ. Sax, Sebastian Schelter, Mareike Hger, Kostas Tzoumas, and Daniel Warneke. The stratosphere platform for big data analytics. The VLDB Journal, pages 1 26, 2014 Hai Qian. Pivotalr: A package for machine learning on big data. The R Journal, 6(1):57 67, June 2014 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Commun. ACM, 39(11):27 34, November 1996

Recap 1 Introduction Data analysis to the masses Options Basic terms and definitions 2 Motivation Motivating problems Writing massively parallel programs Relation with the KDD process Design goals 3 Our approach Architecture Library R package Example Performance 4 Related work Data mining libraries Data intensive computation with R 5 Conclusions and Future Work Conclusion Future work Essential bibliography