Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Similar documents

Spark and the Big Data Library

Introduction to Big Data with Apache Spark UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Architectures for massive data management

MLlib: Scalable Machine Learning on Spark

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics. Lucas Rego Drumond

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Moving From Hadoop to Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Unified Big Data Analytics Pipeline. 连城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem B Y R A H I M A.

Big Data Frameworks: Scala and Spark Tutorial

Scaling Out With Apache Spark. DTL Meeting Slides based on

Beyond Hadoop with Apache Spark and BDAS

Apache Spark and Distributed Programming

Introduction to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Real Time Data Processing using Spark Streaming

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Analytics Hadoop and Spark

Spark. Fast, Interactive, Language- Integrated Cluster Computing

HiBench Introduction. Carson Wang Software & Services Group

Apache Flink Next-gen data analysis. Kostas

How To Create A Data Visualization With Apache Spark And Zeppelin

Streaming items through a cluster with Spark Streaming

Data Science in the Wild

Spark: Cluster Computing with Working Sets

Hadoop & Spark Using Amazon EMR

Big Data and Scripting Systems beyond Hadoop

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

How Companies are! Using Spark

Introduc8on to Apache Spark

Big Data and Data Science: Behind the Buzz Words

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Bayesian networks - Time-series models - Apache Spark & Scala

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

What s Cooking in KNIME

Big Data and Scripting Systems build on top of Hadoop

The Internet of Things and Big Data: Intro

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Workshop on Hadoop with Big Data

Ali Ghodsi Head of PM and Engineering Databricks

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SEIZE THE DATA SEIZE THE DATA. 2015

BIG DATA APPLICATIONS

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

BIG DATA What it is and how to use?

Databricks. A Primer

Spark Application Carousel. Spark Summit East 2015

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

HADOOP. Revised 10/19/2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Databricks. A Primer

This is a brief tutorial that explains the basics of Spark SQL programming.

Shark Installation Guide Week 3 Report. Ankush Arora

Brave New World: Hadoop vs. Spark

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Hadoop Job Oriented Training Agenda

Certification Study Guide. MapR Certified Spark Developer Study Guide

Advanced Big Data Analytics with R and Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

COMP9321 Web Application Engineering

TRAINING PROGRAM ON BIGDATA/HADOOP

Conquering Big Data with BDAS (Berkeley Data Analytics)

Write Once, Run Anywhere Pat McDonough

Big Data and Analytics: Challenges and Opportunities

From Spark to Ignition:

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Map-Reduce for Machine Learning on Multicore

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

ANALYTICS CENTER LEARNING PROGRAM

Big Data Research in the AMPLab: BDAS and Beyond

Using Apache Spark Pat McDonough - Databricks

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Apache Hama Design Document v0.6

Managing large clusters resources

Hadoop: The Definitive Guide

Big Data With Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Transcription:

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic

About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the last few years Did PhD at UCL in Recommender Systems

Overview Introduction to Spark Architecture and internals Setup and basic config Spark SQL Spart MLLIB Spark Streaming Oscarbao Analytics platform Data Visualisation

What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient Usable Up to 10x faster, 100x in memory General execution graphs In-memory storage 2-5x less code Rich API in Java, Scala and Python Interactive shell

Apache Spark Spark SQL Spark Streaming MLLIB GraphX Apache Spark

Components

Spark Internals Spark client Spark worker Spark Context RDD Graph Scheduler Block Tracker Shuffle Tracker Task Threads Block Manager HDFS, HBase, Hive

Cluster managers Standalone Ideal for running jobs locally Uses Spark s own resource manager Apache Mesos Dynamic partitioning between Spark and others Scalable partitioning between multiple instances of Spark Hadoop YARN Integrate well with other components of the Hadoop ecosystem Supported by all Hadoop vendors (e.g Cloudera, Hortonworks)

Spark Setup Locally Get prebuild version from (http://spark.apache.org/) Setup configuration file (in./conf/spark-env.sh) Start spark master and worker (./sbin/start-all.sh) Start spark interactive shell Scala: (./bin/spark-shell -master spark:// ) Python (./bin/pyspark -master spark://..) EC2 start-up scripts are available (clone https://github.com/apache/spark)

Example Job sc = SparkContext( spark://, myjob, home, jars) file = sc.textfile( hdfs:// ) Resilient distributed datasets (RDD) errors = file.filter(lambda line: line.startswith( Error )) errors.cache() errors.count() Action

Basic Operations - Transformations map(func) data.map(lambda a: a+1) filter(func) Data.filter(lambda a: a>1) mappartitions(func) Runs separately on each partition (block) of the RDD groupbykey() When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs reducebykey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func join(otherdataset) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key

Basic Operations - Actions reduce(func) data.reduce(lambda a,b : a+b) collect() take(n) Return all the elements of the dataset as an array at the driver program. Return an array with the first n elements of the dataset. saveastextfile(path) Write the elements of the dataset as a text file.

Broadcast Variables and Accumulators Broadcast variables To Keep a read-only variable cached on each machine rather than shipping a copy of it with tasks broadcastvar = sc.broadcast([1,2,3]) Accumulators Variables that are only added to through an associative operation and can therefore be efficiently supported in parallel accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))

Spark SQL Enables loading and querying structured data in Spark From Hive: c = HiveContext(sc) rows = c.sql( select date, vale from hivetable ) rows.filter(lambda r: r.value > 2013).collect() From JSON: c.jsonfile( account_info.json ).registerastable( accounts ) c.sql( select surename, address.postcode from accounts ) account_info.json: { first_name : Tamas surname : Jambor address : { postcode : N7 9UP } }

Spark SQL Integrates closely with Spark s language APIs c.registerfunction( hasspark, lambda text: Spark in text) c.sql( select * from accounts where hasspark(text) ) Uniform interface for data access Python Scala Java Spark SQL Hive Parquet JSON Cassandra

Spark MLLIB Standard library of distributed machine learning algorithms Provides some of the most common machine learning algorithms Basic Statistics Classification and regression Linear models Naïve Bayes Collaborative Filtering Clustering Dimensionality reduction Optimisers

Spark MLLIB Now includes 15+ algorithms New in 1.0 Decision trees SVD, PCA, L-BFGS (limited memory parameter estimation method) New in 1.1: Non-negative matrix factorization, ALS Support for common statistical functions Sampling, correlations Statistical hypothesis testing New in 1.2 (to be released): Random forest

Spark MLLIB Example (Classification) from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsepoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textfile("data/mllib/sample_svm_data.txt") parseddata = data.map(parsepoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) # Evaluating the model on training data labelsandpreds = parseddata.map(lambda p: (p.label, model.predict(p.features))) trainerr = labelsandpreds.filter(lambda (v, p): v!= p).count() / (parseddata.count()) print("training Error = " + (trainerr))

Spark MLLIB Example (Clustering) from pyspark.mllib.clustering import Kmeans from numpy import array from math import sqrt # Load and parse the data data = sc.textfile("data/mllib/kmeans_data.txt") parseddata = data.map(lambda line: array([(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxiterations=10, runs=10, initializationmode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(([x**2 for x in (point - center)])) WSSSE = parseddata.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("within Set Sum of Squared Error = " + (WSSSE))

Spark Streaming Many big-data applications need to process large data streams in real-time Web-site monitoring Fraud detection Batch model for iterative ML algorithms and data processing Extends Spark for doing big data stream processing Efficiently recover from failures

Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post-processing Existing frameworks cannot do both Either, stream processing of 100s of MB/s with low latency Or, batch processing of TBs of data with high latency

Spark Streaming Workflow

Spark Streaming Workflow

Spark Streaming Example (Scala) import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.streamingcontext._ val conf = new SparkConf().setMaster( spark:// ").setappname("networkwordcount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination()

Spark Streaming Word count

Demo

Oscar BAO Business analytics optimisation Automates certain data science processes Runs on Spark Visualisation Interface

Oscarbao Architecture

Modules Understanding Data exploration Rules engine Cognition Predicting Predictive analytics Text mining Personalisation Optimising Algorithm selection engine Recommended algorithm Optimisation and control

Big Data Workflow for Predictive Analytics Data Ingestion Parsing Cleansing Data Preprocessing Transformation Statistics Model Training Parameter tuning Performance Algorithm Selection Evaluation Best model for the purpose Visualisation Representing results Insights Reuse Store model Apply to new data

Demo