Big Data Analytics. Lucas Rego Drumond



Similar documents
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Architectures for massive data management

Beyond Hadoop with Apache Spark and BDAS

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Big Data with Apache Spark UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Spark

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Unified Big Data Analytics Pipeline. 连 城

Hadoop Ecosystem B Y R A H I M A.

Big Data and Scripting Systems beyond Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Flink Next-gen data analysis. Kostas

Streaming items through a cluster with Spark Streaming

How To Create A Data Visualization With Apache Spark And Zeppelin

Scaling Out With Apache Spark. DTL Meeting Slides based on

Unified Big Data Processing with Apache Spark. Matei

Apache Spark and Distributed Programming

Spark and the Big Data Library

Machine- Learning Summer School

Big Data Frameworks: Scala and Spark Tutorial

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Spark: Making Big Data Interactive & Real-Time

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Analytics Hadoop and Spark

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Certification Study Guide. MapR Certified Spark Developer Study Guide

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Data Science in the Wild

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Moving From Hadoop to Spark

Write Once, Run Anywhere Pat McDonough

Big Data Analytics with Cassandra, Spark & MLLib

Big Data and Scripting Systems build on top of Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Other Map-Reduce (ish) Frameworks. William Cohen

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

BIG DATA APPLICATIONS

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HiBench Introduction. Carson Wang Software & Services Group

Brave New World: Hadoop vs. Spark

Shark Installation Guide Week 3 Report. Ankush Arora

Workshop on Hadoop with Big Data

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Writing Standalone Spark Programs

Conquering Big Data with BDAS (Berkeley Data Analytics)

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

This is a brief tutorial that explains the basics of Spark SQL programming.

Real Time Data Processing using Spark Streaming

MLlib: Scalable Machine Learning on Spark

Ali Ghodsi Head of PM and Engineering Databricks

The Internet of Things and Big Data: Intro

Big Data. Lyle Ungar, University of Pennsylvania

How Companies are! Using Spark

Beyond Hadoop MapReduce Apache Tez and Apache Spark

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Bayesian networks - Time-series models - Apache Spark & Scala

VariantSpark: Applying Spark-based machine learning methods to genomic information

Big Data Analytics. Lucas Rego Drumond

Spark Application on AWS EC2 Cluster:

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hadoop & Spark Using Amazon EMR

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Spark: Cluster Computing with Working Sets

An approach to Machine Learning with Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop


Big Data Analysis: Apache Storm Perspective

Big Data Analytics. Lucas Rego Drumond

Big Data With Hadoop

TRAINING PROGRAM ON BIGDATA/HADOOP

The Stratosphere Big Data Analytics Platform

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How To Understand The Programing Framework For Distributed Computing

Hadoop: The Definitive Guide

Databricks. A Primer

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Introduc8on to Apache Spark

Infrastructures for big data

Transcription:

Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1 / 26

Big Data Analytics Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 1 / 26

Big Data Analytics 1. What is Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 1 / 26

Big Data Analytics 1. What is Spark Spark Overview Apache Spark is an open source framework for large scale processing and analysis Main Ideas: Processing occurs where the resides Avoid moving over the network Works with the in Memory Technical details: Written in Scala Work seamlessly with Java and Python Developed at UC Berkeley Apache Spark 1 / 26

Big Data Analytics 1. What is Spark Apache Spark Stack Data platform: Distributed file system / base Ex: HDFS, HBase, Cassandra Execution Environment: single machine or a cluster Standalone, EC2, YARN, Mesos Spark Core: Spark API Spark Ecosystem: libraries of common algorithms MLLib, GraphX, Streaming Apache Spark 2 / 26

Big Data Analytics 1. What is Spark Apache Spark Ecosystem Apache Spark 3 / 26

Big Data Analytics 1. What is Spark How to use Spark Spark can be used through: The Spark Shell Available in Python and Scala Useful for learning the Framework Spark Applications Available in Python, Java and Scala For serious large scale processing Apache Spark 4 / 26

Big Data Analytics 2. Working with Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 5 / 26

Big Data Analytics 2. Working with Spark Working with Spark Working with Spark requires accessing a Spark Context: Main entry point to the Spark API Already preconfigured in the Shell Most of the work in Spark is a set of operation on Resilient Distributed Datasets (RDDs): Main abstraction The used and generated by the application is stored as RDDs Apache Spark 5 / 26

Big Data Analytics 2. Working with Spark Spark Interactive Shell (Python) $. / b i n / p y s p a r k Welcome to / / / / \ \/ \/ / / / / /. /\, / / / /\ \ v e r s i o n 1. 3. 1 / / Using Python v e r s i o n 2. 7. 6 ( d e f a u l t, Jun 22 2015 1 7 : 5 8 : 1 3 ) S p a r k C o n t e x t a v a i l a b l e as sc, SQLContext a v a i l a b l e as s q l C o n t e x t. >>> Apache Spark 6 / 26

Big Data Analytics 2. Working with Spark 2.1 Spark Context Spark Context The Spark Context is the main entry point for the Spark functionality. It represents the connection to a Spark cluster Allows to create RDDs Allows to broadcast variables on the cluster Allows to create Accumulators Apache Spark 7 / 26

Resilient Distributed Datasets (RDDs) A Spark application stores as RDDs Resilient if in memory is lost it can be recreated (fault tolerance) Distributed stored in memory across different machines Dataset coming from a file or generated by the application A Spark program is about operations on RDDs RDDs are immutable: operations on RDDs may create new RDDs but never change them Apache Spark 8 / 26

Resilient Distributed Datasets (RDDs) RDD RDD Element RDD elements can be stored in different machines (transparent to the developer) can have various types Apache Spark 9 / 26

RDD Data types An element of an RDD can be of any type as long as it is serializable Example: Primitive types: integers, characters, strings, floating point numbers,... Sequences: lists, arrays, tuples... Pair RDDs: key-value pairs Serializable Scala/Java objects A single RDD may have elements of different types Some specific element types have additional functionality Apache Spark 10 / 26

Example: Text file to RDD File: mydiary.txt I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. Apache Spark 11 / 26

RDD operations There are two types of RDD operations: Actions: return a value based on the RDD Example: count: returns the number of elements in the RDD first(): returns the first element in the RDD take(n): returns an array with the first n elements in the RDD Transformations: creates a new RDD based on the current one Example: filter: returns the elements of an RDD which match a given criterion map: applies a particular function to each RDD element reduce: applies a particular function to each RDD element Apache Spark 12 / 26

Actions vs. Transformations RDD Action Value BaseRDD Transformation NewRDD Apache Spark 13 / 26

Actions examples File: mydiary.txt I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. >>> my = s c. t e x t F i l e ( mydiary. t x t ) >>> my. count ( ) 5 >>> my. f i r s t ( ) u I had b r e a k f a s t t h i s morning. >>> my. t a k e ( 2 ) [ u I had b r e a k f a s t t h i s morning., u The c o f f e e was r e a l l y good. ] Apache Spark 14 / 26

Transformation examples RDD: my I had breakfast this morning. The coffee was really good. I didn't like the bread though. But I had cheese. Oh I love cheese. RDD: filtered I had breakfast this morning. I didn't like the bread though. filter But I had cheese. Oh I love cheese. filter(lambda line: "I " in line) RDD: filtered I had breakfast this morning. RDD: filtermap I HAD BREAKFAST THIS MORNING. I didn't like the bread though. But I had cheese. Oh I love cheese. map I DIDN'T LIKE THE BREAD THOUGH. BUT I HAD CHEESE. OH I LOVE CHEESE. map(lambda line: line.upper()) Apache Spark 15 / 26

Transformations examples >>> f i l t e r e d = my. f i l t e r ( lambda l i n e : I i n l i n e ) >>> f i l t e r e d. count ( ) 4 >>> f i l t e r e d. t a k e ( 4 ) [ u I had b r e a k f a s t t h i s morning., u I d i d n t l i k e t h e bread though., u But I had c h e e s e., u Oh I l o v e c h e e s e. ] >>> f i l t e r M a p = f i l t e r e d. map( lambda l i n e : l i n e. upper ( ) ) >>> f i l t e r M a p. count ( ) 4 >>> f i l t e r M a p. t a k e ( 4 ) [ u I HAD BREAKFAST THIS MORNING., u I DIDN T LIKE THE BREAD THOUGH., u BUT I HAD CHEESE., u OH I LOVE CHEESE. ] Apache Spark 16 / 26

Operations on specific types Numeric RDDs have special operations: mean() min() max()... >>> l i n e l e n s = my. map( lambda l i n e : l e n ( l i n e ) ) >>> l i n e l e n s. c o l l e c t ( ) [ 2 9, 27, 31, 17, 1 7 ] >>> l i n e l e n s. mean ( ) 2 4. 2 >>> l i n e l e n s. min ( ) 17 >>> l i n e l e n s. max ( ) 31 >>> l i n e l e n s. s t d e v ( ) 6. 0133185513491636 Apache Spark 17 / 26

Operations on Key-Value Pairs Pair RDDs contain a two element tuple: (K, V ) Keys and values can be of any type Extremely useful for implementing MapReduce algorithms Examples of operations: groupbykey reducebykey aggregatebykey sortbykey... Apache Spark 18 / 26

Word Count Example Map: Input: document-word list pairs Output: word-count pairs Reduce: Input: word-(count list) pairs Output: word-count pairs (d k, w 1,..., w m) [(w i, c i )] (w i, [c i ]) (w i, c [c i ] c) Apache Spark 19 / 26

Word Count Example (d1, love ain't no stranger ) (d2, crying in the rain ) (love, 1) (stranger,1) (crying, 1) (rain, 1) (ain't, 1) (stranger,1) (love, 1) (love, 2) (love, 2) (crying, 1) (stranger,1) (love, 5) (crying, 2) (crying, 1) (d3, looking for love ) (love, 2) (d4, I'm crying ) (d5, the deeper the love ) (crying, 1) (looking, 1) (deeper, 1) (ain't, 1) (ain't, 2) (ain't, 1) (rain, 1) (rain, 1) (looking, 1) (d6, is this love ) (love, 2) (ain't, 1) (looking, 1) (deeper, 1) (deeper, 1) (this, 1) (d7, Ain't no love ) (this, 1) (this, 1) Mappers Reducers Apache Spark 20 / 26

Word Count on Spark RDD: my I had breakfast this morning. RDD I RDD (I,1) RDD (I,4) The coffee was really good. had (had,1) (had,2) I didn't like the bread though. flatmap breakfast map (breakfast,1) reducebykey (breakfast,1) But I had cheese. this (this,1) (this,1) Oh I love cheese. morning. (morning.,1) (morning.,1)......... >>> c o u n t s = my. f l a t M a p ( lambda l i n e : l i n e. s p l i t ( ) ) \. map( lambda word : ( word, 1 ) ) \. reducebykey ( lambda x, y : x + y ) Apache Spark 21 / 26

ReduceByKey. reducebykey ( lambda x, y : x + y ) ReduceByKey works a little different from the MapReduce reduce function: It takes two arguments: combines two values at a time associated with the same key Must be commutative: reducebykey(x,y) = reducebykey(y,x) Must be associative: reducebykey(reducebykey(x,y), z) = reducebykey(x,reducebykey(y,z)) Spark does not guarantee on which order the reducebykey functions are executed! Apache Spark 22 / 26

Considerations Spark provides a much more efficient MapReduce implementation then Hadoop: Higher level API In memory storage (less I/O overhead) Chaining MapReduce operations is simplified: sequence of MapReduce passes can be done in one job Spark vs. Hadoop on training a logistic regression model: Source: Apache Spark. https://spark.apache.org/ Apache Spark 23 / 26

Big Data Analytics 3. MLLib: Machine Learning with Spark Outline 1. What is Spark 2. Working with Spark 2.1 Spark Context 2.2 Resilient Distributed Datasets 3. MLLib: Machine Learning with Spark Apache Spark 24 / 26

Big Data Analytics 3. MLLib: Machine Learning with Spark Overview MLLib is a Spark Machine Learning library containing implementations for: Computing Basic Statistics from Datasets Classification and Regression Collaborative Filtering Clustering Feature Extraction and Dimensionality Reduction Frequent Pattern Mining Optimization Algorithms Apache Spark 24 / 26

Big Data Analytics 3. MLLib: Machine Learning with Spark Logistic Regression with MLLib Import necessary packages: from p y s p a r k. m l l i b. r e g r e s s i o n import L a b e l e d P o i n t from p y s p a r k. m l l i b. u t i l import MLUtils from p y s p a r k. m l l i b. c l a s s i f i c a t i o n import L o g i s t i c R e g r e s s i o n W i t h S G D Read the (LibSVM format): d a t a s e t = MLUtils. l o a d L i b S V M F i l e ( sc, / m l l i b / s a m p l e l i b s v m d a t a. t x t ) Apache Spark 25 / 26

Big Data Analytics 3. MLLib: Machine Learning with Spark Logistic Regression with MLLib Train the Model: model = L o g i s t i c R e g r e s s i o n W i t h S G D. t r a i n ( d a t a s e t ) Evaluate: l a b e l s A n d P r e d s = d a t a s e t. map( lambda p : ( p. l a b e l, model. p r e d i c t ( p. f e a t u r e s ) ) ) t r a i n E r r = l a b e l s A n d P r e d s. f i l t e r ( lambda ( v, p ) : v!= p ). count ( ) / f l o a t ( d a t a s e t. count ( ) ) p r i n t ( T r a i n i n g E r r o r = + s t r ( t r a i n E r r ) ) Apache Spark 26 / 26