Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Similar documents

Writing Standalone Spark Programs

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Using Apache Spark Pat McDonough - Databricks

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Analytics Pipeline. 连城

Spark and the Big Data Library

Spark: Cluster Computing with Working Sets

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark Application Carousel. Spark Summit East 2015

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Analytics. Lucas Rego Drumond

DATA ANALYSIS II. Matrix Algorithms

Big Graph Processing: Some Background

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

The PageRank Citation Ranking: Bring Order to the Web

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Big Data looks Tiny from the Stratosphere

Big Data and Scripting Systems beyond Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Presto/Blockus: Towards Scalable R Data Analysis

Introduction to Big Data with Apache Spark UC BERKELEY

Kafka & Redis for Big Data Solutions

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Chapter 13: Query Processing. Basic Steps in Query Processing

Hadoop SNS. renren.com. Saturday, December 3, 11

Big Data and Scripting Systems build on top of Hadoop

Apache Flink Next-gen data analysis. Kostas

Parallelism and Cloud Computing

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Distance Degree Sequences for Network Analysis

MapReduce and the New Software Stack

Search Engines. Stephen Shaw 18th of February, Netsoc

Cloud Computing at Google. Architecture

Beyond Hadoop with Apache Spark and BDAS

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

Apache Spark and Distributed Programming

Data Science in the Wild

Introduction to Spark

The Stratosphere Big Data Analytics Platform

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Enhancing the Ranking of a Web Page in the Ocean of Data

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Processing. Patrick Wendell Databricks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics. Lucas Rego Drumond

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data With Hadoop

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

MLlib: Scalable Machine Learning on Spark

Developing MapReduce Programs

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Distributed Computing and Big Data: Hadoop and MapReduce

ProTrack: A Simple Provenance-tracking Filesystem

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

HiBench Introduction. Carson Wang Software & Services Group

Integrating Apache Spark with an Enterprise Data Warehouse

Big Data & Scripting Part II Streaming Algorithms

Unified Big Data Processing with Apache Spark. Matei

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Machine Learning Big Data using Map Reduce

Data Warehousing und Data Mining

Storage Management for Files of Dynamic Records

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Spark: Making Big Data Interactive & Real-Time

5. Binary objects labeling

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

SOLVING LINEAR SYSTEMS

Storing Measurement Data

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

Quantifind s story: Building custom interactive data analytics infrastructure

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Fast Analytics on Big Data with H20

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Search Engine Architecture

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data Systems CS 5965/6965 FALL 2015

Apache Flink. Fast and Reliable Large-Scale Data Processing

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

File Management. Chapter 12

Streamdrill: Analyzing Big Data Streams in Realtime

File Systems Management and Examples

Estimating PageRank Values of Wikipedia Articles using MapReduce

Optimization and analysis of large scale data sorting algorithm based on Hadoop

DB2 for i5/os: Tuning for Performance

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Transcription:

Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com

Challenges of numerical computation over big data When applying any algorithm to big data watch for 1. Correctness 2. Performance 3. Trade-off between accuracy and performance 2

Three Practical Examples Point estimation (Variance) Approximate estimation (Cardinality) Matrix operations (PageRank) We use these examples to demonstrate Spark internals, data flow, and challenges of implementing algorithms for Big Data. 3

1. Big Data Variance > The plain variance formula requires two passes over data Second pass Var(X) = 1 N N i=1 (x i µ) 2 4 First pass

Fast but inaccurate solution Var(X) = E[X 2 ] E[X] 2 = x 2 N x N 2 Can be performed in a single pass, but Subtracts two very close and large numbers! 5

Accumulator Pattern An object that incrementally tracks the variance Class RunningVar { var variance: Double = 0.0! // Compute initial variance for numbers def this(numbers: Iterator[Double]) { numbers.foreach(this.add(_)) }! // Update variance for a single value def add(value: Double) {... } } 6

Parallelize for performance Distribute adding values in map phase Merge partial results in reduce phase Class RunningVar {... // Merge another RunningVar object // and update variance def merge(other: RunningVar) = {... } } 7

Computing Variance in Spark Use the RunningVar in Spark doublerdd.mappartitions(v => Iterator(new RunningVar(v))).reduce((a, b) => a.merge(b)) Or simply use the Spark API doublerdd.variance() 8

2. Approximate Estimations Often an approximate estimate is good enough especially if it can be computed faster or cheaper 1. Trade accuracy with memory 2. Trade accuracy with running time We really like the cases where there is a bound on error that can be controlled 9

Cardinality Problem Example: Count number of unique words in Shakespeare s work. Using a HashSet requires ~10GB of memory This can be much worse in many real world applications involving large strings, such as counting web visitors 10

count mln v m Linear Probabilistic Counting 1. Allocate a bitmap of size m and initialize to zero. A. Hash each value to a position in the bitmap B. Set corresponding bit to 1 2. Count number of empty bit entries: v 11

The Spark API Use the LogLinearCounter in Spark rdd.mappartitions(v => Iterator(new LPCounter(v))).reduce((a, b) => a.merge(b)).getcardinality Or simply use the Spark API myrdd.countapproxdistinct(0.01) 12

3. Google PageRank Popular algorithm originally introduced by Google 13

PageRank Algorithm PageRank Algorithm Start each page with a rank of 1 On each iteration: A. contrib = currank neighbors B. currank = 0.15 + 0.85 neighbors contrib i 14

PageRank Example 1.0 1.0 1.0 1.0 15

PageRank Example 1.0 1.0 0.5 1.0 1.0 1.0 0.5 0.5 1.0 16

PageRank Example 1.85 0.58 1.0 0.58 17

PageRank Example 0.58 1.85 0.5 1.85 0.58 1.0 0.29 0.5 0.58 18

PageRank Example 1.31 0.39 1.72 0.58 19

PageRank Example 1.44 0.46 Eventually 1.37 0.73 20

PageRank as Matrix Multiplication Rank of each page is the probability of landing on that page for a random surfer on the web Probability of visiting all pages after k steps is V k = A k V t V: the initial rank vector A: the link structure (sparse matrix) 21

Data Representation in Spark Each page is identified by its unique URL rather than an index Ranks vectors (V): RDD[(URL, Double)] Links matrix (A): RDD[(URL, List(URL))] 22

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs! for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...) 23

Matrix Multiplication Repeatedly multiply sparse matrix and vector Same file read over and over Links (url, neighbors) Ranks (url, rank) iteration 1 iteration 2 iteration 3 24

Spark can do much better Using cache(), keep neighbors in memory Do not write intermediate results on disk Grouping same RDD over and over Links (url, neighbors) Ranks (url, rank) join 25 join join

Spark can do much better Do not partition neighbors every time partitionby Links (url, neighbors) Same node Ranks (url, rank) join 26 join join

Spark Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // load RDD of (url, rank) pairs! links.partitionby(hashfunction).cache()! for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatmap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey(_ + _).mapvalues(0.15 + 0.85 * _) } ranks.saveastextfile(...) 27

Conclusions When applying any algorithm to big data watch for 1. Correctness 2. Performance Cache RDDs to avoid I/O Avoid unnecessary computation 3. Trade-off between accuracy and performance 28