Machine- Learning Summer School - 2015



Similar documents
CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming

CS 378 Big Data Programming. Lecture 1 Introduc:on

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Processing with Apache Spark. Matei

Big Data With Hadoop

Big Data Analytics. Lucas Rego Drumond

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Unified Big Data Analytics Pipeline. 连 城

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Streaming items through a cluster with Spark Streaming

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Spark

Big Data and Apache Hadoop s MapReduce

Spark: Making Big Data Interactive & Real-Time

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Large scale processing using Hadoop. Ján Vaňo

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Scale Out Of A Nosql Database

Hadoop Ecosystem B Y R A H I M A.

Architectures for massive data management

How To Create A Data Visualization With Apache Spark And Zeppelin

Beyond Hadoop with Apache Spark and BDAS

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Analytics Hadoop and Spark

Hadoop. Sunday, November 25, 12

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Apache Flink Next-gen data analysis. Kostas

CS 378 Big Data Programming. Lecture 24 RDDs

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Processing. Patrick Wendell Databricks

HiBench Introduction. Carson Wang Software & Services Group

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Spark Application Carousel. Spark Summit East 2015

Apache Spark and Distributed Programming

GraySort on Apache Spark by Databricks

Introduction to Big Data with Apache Spark UC BERKELEY

Map Reduce / Hadoop / HDFS

CSE-E5430 Scalable Cloud Computing Lecture 2

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop & Spark Using Amazon EMR

Brave New World: Hadoop vs. Spark

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How Companies are! Using Spark

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Moving From Hadoop to Spark

Spark and the Big Data Library

Big Data and Scripting Systems build on top of Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Map Reduce & Hadoop Recommended Text:

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Scaling Out With Apache Spark. DTL Meeting Slides based on

Next-Gen Big Data Analytics using the Spark stack

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Hadoop

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

What s next for the Berkeley Data Analytics Stack?

Cloud Computing at Google. Architecture

Can the Elephants Handle the NoSQL Onslaught?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Dell In-Memory Appliance for Cloudera Enterprise

The Hadoop Framework

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Improving MapReduce Performance in Heterogeneous Environments

BIG DATA What it is and how to use?

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Introduction to Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Lecture Data Warehouse Systems

Open source Google-style large scale data analysis with Hadoop

Hadoop Parallel Data Processing

Big Data Research in the AMPLab: BDAS and Beyond

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Introduc8on to Apache Spark

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data Frameworks: Scala and Spark Tutorial

Kafka & Redis for Big Data Solutions

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

NoSQL and Hadoop Technologies On Oracle Cloud

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Transcription:

Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/

Goals for Today Issues to address when you have big data Understand two popular big data tools/systems How they work, how they are used Vocabulary of these ecosystems Some directon on how you can get started IdenTfy some ML work using big data MLSS 2015 Big Data Programming 2

Learning from Data What can we do when the data gets big? Too big for the CPU memory of any single machine Larger than the disk storage of a single machine Recent data point: Facebook has ~800 petabyte data cluster (Hadoop) 1 petabyte = 10 15 bytes Big data is spread across a network of machines MLSS 2015 Big Data Programming 3

Learning from Big Data Need to bring distributed storage and distributed processing to bear to handle big data Issues: DistribuTng computaton across many machines Maximizing performance Minimize I/O to disk, minimize transfers across the network Combining the results of distributed computaton Recovering from failures MLSS 2015 Big Data Programming 4

Managing Big Data We ll look at two popular tools/systems One well established Hadoop One up and coming Spark Basic concepts of each How they address the aforementoned issues MLSS 2015 Big Data Programming 5

Managing Big Data When writng a program with these tools You don t know the size of the data You don t know the extent of the parallelism Both try to collocate the computaton with the data Parallelize the I/O Make the I/O local (versus across network) Data is oeen unstructured (vs. relatonal model) MLSS 2015 Big Data Programming 6

Big Data vs. RelaTonal RDBMS normalizaton Goal is to remove redundancy and retain/insure integrity Big data apps want reads to be local Send the code to the data, as it much smaller (Jim Gray) NormalizaTon makes read non- local Processing examines one input record at a Tme Minimal state in programs it s in the data MLSS 2015 Big Data Programming 7

Big Data Tools This all sounds great. What are the issues? CoordinaTng the distributed computaton Handling partal failures Combining the results of distributed computaton Tools offer a programming model that abstracts Disk read and write ParallelizaTon (computaton and I/O) Combining data (keys and values) MLSS 2015 Big Data Programming 8

Hadoop Open source project, supported by Yahoo Based on work done inside Google MapReduce, GFS Paper: hbp://research.google.com/archive/mapreduce.html Hadoop implements the MapReduce programming paradigm Provides a template for programs HDFS Hadoop Distributed File System MLSS 2015 Big Data Programming 9

Unstructured Data Tom White, in Hadoop: The Defini/ve Guide MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing /me. In other words, the input keys and values for MapReduce are not intrinsic proper/es of the data, but they are chosen by the persona analyzing the data. MLSS 2015 Big Data Programming 10

MapReduce What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments one input line/record For the most part, map and reduce tasks are stateless Write once, read multple Tmes Data Warehouse has this intended usage (write once) Unstructured data vs. structured/normalized Data pipelines (chains of map- reduce tasks) are common MLSS 2015 Big Data Programming 11

MapReduce Table 1-1, Hadoop The DefiniTve Guide Tradi&onal RDBMS MapReduce Data Size Gigabytes Petabytes Access InteracTve and batch Batch Updates Read and write many Tmes Write once, read many Structure StaTc schema Dynamic schema Integrity High Low Scaling Nonlinear Linear MLSS 2015 Big Data Programming 12

Structure of MapReduce HDFS stores a data file into multple parts (parttons) Each partton can be located on a different machine A map- reduce program defines Map task Reduce task A map task reads and processes one or more parttons A reduce task writes one partton of a file MLSS 2015 Big Data Programming 13

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniTve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks MLSS 2015 Big Data Programming 14

Map FuncTon Map input is a stream of key/value pairs Web logs: Server name (key), log entry (value) Sensor reading: sensor ID (key), sensed values (value) Record number (key), record (value) Map functon processes each input pair in turn For each input pair, the map functon can (but isn t required to) emit one or more key/value pairs MLSS 2015 Big Data Programming 15

Reduce FuncTon Reduce input is a stream of key/value- list pairs These are the key value pairs emibed by the map functon Reduce functon processes each input pair in turn All key/value pairs with the same key sent to same reducer For each input pair, the reduce functon can (but isn t required) to emit a key/value pair MLSS 2015 Big Data Programming 16

WordCount Example MulTple text files of arbitrary size, or An arbitrary number of documents Count the occurrences of all the words in the input Output: word1, count word2, count MLSS 2015 Big Data Programming 17

WordCount Example - Map Map task input is a stream of key/value pairs Document ID (key), document (value) Map task processes each input pair in turn Extract each word from the document Emits a key/value pair for each word: <the- word, 1> Each map task emits multple key/value pairs For all words in the documents received as input MLSS 2015 Big Data Programming 18

WordCount Example - Reduce Reduce task input is a stream of key/value- list pairs These are the key value pairs emibed by map tasks Key is a text string (the word), value is a list of counts Reduce task processes each input pair in turn Sums the values in the value- list For each input, the reduce task emits a key/value pair Key is a text string (the word), value is the total count MLSS 2015 Big Data Programming 19

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniTve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks MLSS 2015 Big Data Programming 20

WordCount Example Code Hadoop map- reduce code example MLSS 2015 Big Data Programming 21

MapReduce Design PaBerns SummarizaTon Filtering Data OrganizaTon ParTToning/binning, sortng, shuffle Joins Merging data sets Meta- paberns OpTmizing map- reduce chains (data pipelines) MLSS 2015 Big Data Programming 22

Issues with MapReduce One template : map, then reduce HDFS is its own file system In a data pipeline, each map- reduce step Reads all input data from disk Writes all output data to disk Even if output is just an intermediate result Addresses failure handling with replicated data Can help performance though MLSS 2015 Big Data Programming 23

Apache Spark Open source project out of AMPLab at UC Berkeley A Spark program defines: TransformaTons and actons on data sets Data flow, or lineage graph among data sets, induced by the transformatons Data sets in Spark are called RDDs Resilient Distributed Datasets MLSS 2015 Big Data Programming 24

Spark Features Provide domain specific libraries Example: map- reduce library Promotes functonal programming model Access to multple data (file) systems Local, HDFS, Cassandra, S3, database tables, Lazy evaluaton, and caching for performance Reduce or eliminate disk I/O Support mult- stage and iteratve apps MLSS 2015 Big Data Programming 25

Spark RDDs Resilient Distributed Dataset One RDD has one or more partton ParTTons are distributed across machines Rebuilt from base data on failure (versus replicaton) Lazy evaluaton created on demand RDD types offer various functons map, reduce groupby, reducebykey joins (inner, leeouterjoin, rightouterjoin) filter, sample MLSS 2015 Big Data Programming 26

Spark Provides a higher level of abstracton for coding MulT- stage map- reduce pipeline in Hadoop Can be composed functons in Spark RDD support and libraries Spark SQL RDD representng relatonal table Streaming data D- Stream, TwiBer stream Graph data GraphX MLSS 2015 Big Data Programming 27

Spark Stack Learning Spark, Figure 1-1. MLSS 2015 Big Data Programming 28

Spark Programs A Spark program defines: RDDs Input from external sources Produced by a transformaton TransformaTons Produce a new RDD from the input RDD AcTons Compute something from the input RDD Return non- RDD objects (e.g., number) Write an RDD to external storage MLSS 2015 Big Data Programming 29

Spark TransformaTons TransformaTons take functons as arguments This functon defines the details of the transform FuncTon is applied to each element of the RDD Filter rdd2 = rdd1.filter(fn) fn is a boolean functon applied to elements of rdd1 Evaluates to true, element is in rdd2 Union rdd3 = rdd1.union(rdd2) rdd3 has all elements of rdd1 and rdd2 MLSS 2015 Big Data Programming 30

Spark TransformaTons Map rdd2 = rdd1.map(fn) fn is applied to each element of rdd1 Flat map rdd2 = rdd1.flatmap(fn). fn returns a list for each element of rdd1 rdd2 is the concatenaton of these lists MLSS 2015 Big Data Programming 31

Spark AcTons TransformaTons do not initate computaton (lazy eval) AcTons on RDDs initate computaton CounTng rdd.count() Extract some elements rdd1.take(10) Get all elements rdd.collect() Careful, as this assembles the entre RDD. Filter first. MLSS 2015 Big Data Programming 32

Spark AcTons Each acton on an RDD causes it to be recomputed To use an RDD for more than one acton We tell Spark to persist the RDD MulTple levels of persistence Memory only Memory and disk Disk only Can result in big performance wins over map- reduce MLSS 2015 Big Data Programming 33

Spark Code Example WordCount again MLSS 2015 Big Data Programming 34

Spark Other Stuff TransformaTons allow the program to control the amount of parallelism Specify the parttons in the RDD By default, parttons in output RDD same as input RDD The Spark PairRDD (key/value) Provides map- reduce functonality MLSS 2015 Big Data Programming 35

Machine Learning on Big Data Mahout Algorithms implemented using map- reduce As of last year (April 2014), moving to Spark See: hbp://mahout.apache.org/ Spark library for machine learning: MLlib hbp://spark.apache.org/docs/1.1.1/mllib- guide.html H 2 O: Another Spark based machine learning tool hbp://oxdata.com MenTon of possible integraton with Mahout MLSS 2015 Big Data Programming 36

Hadoop Ecosystem thebigdatablog.weebly.com MLSS 2015 Big Data Programming 37

Resources for Hadoop Hadoop: The Defini/ve Guide, 3rd Edi/on, by Tom White O Reilly Media Print ISBN: 978-1- 4493-1152- 0 ISBN 10: 1-4493- 1152-0 Ebook ISBN: 978-1- 4493-1151- 3 ISBN 10: 1-4493- 1151-2 MapReduce Design PaLerns, by Donald Miner and Adam Shook O Reilly Media Print ISBN: 978-1- 4493-2717- 0 ISBN 10: 1-4493- 2717-6 Ebook ISBN: 978-1- 4493-4197- 8 ISBN 10: 1-4493- 4197-7 hbp://hadoop.apache.org/ Several vendors provide Hadoop distributons Amazon Web Services ElasTcMapReduce MLSS 2015 Big Data Programming 38

Resources for Spark Learning Spark, (early release) by Holden Karau, Andy Konwinsky, Patrick Wendell, Matei Zaharia O Reilly Media Print ISBN: 978-1- 4493-5862- 4 ISBN 10: 1-4493- 5862-4 Ebook ISBN: 978-1- 4493-5860- 0 ISBN 10: 1-4493- 5860-8 hbp://spark.apache.org/ Can download a version that runs on your local machine Cloud services Spark on AWS DataBricks offers a cloud service Others will join the party MLSS 2015 Big Data Programming 39