Distributed Computing" with Open-Source Software

Similar documents

Spark and the Big Data Library

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Spark: Making Big Data Interactive & Real-Time

Introduc8on to Apache Spark

Spark: Cluster Computing with Working Sets

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Analytics Pipeline. 连城

Data Science in the Wild

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Brave New World: Hadoop vs. Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Beyond Hadoop with Apache Spark and BDAS

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Analytics Hadoop and Spark

Real Time Data Processing using Spark Streaming

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Architectures for massive data management

Big Data Analytics. Lucas Rego Drumond

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

How Companies are! Using Spark

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Processing. Patrick Wendell Databricks

MLlib: Scalable Machine Learning on Spark

Introduction to Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop Ecosystem B Y R A H I M A.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Moving From Hadoop to Spark

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Conquering Big Data with BDAS (Berkeley Data Analytics)

This is a brief tutorial that explains the basics of Spark SQL programming.

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Conquering Big Data with Apache Spark

Big Data With Hadoop

Shark Installation Guide Week 3 Report. Ankush Arora

Using Apache Spark Pat McDonough - Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

The Stratosphere Big Data Analytics Platform

Hybrid Software Architectures for Big

Apache Spark and Distributed Programming

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

The Internet of Things and Big Data: Intro

Ali Ghodsi Head of PM and Engineering Databricks

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Streaming items through a cluster with Spark Streaming

BIG DATA APPLICATIONS

Apache Flink Next-gen data analysis. Kostas

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

HiBench Introduction. Carson Wang Software & Services Group

Big Data and Scripting Systems beyond Hadoop

Spark Application Carousel. Spark Summit East 2015

Big Data Frameworks: Scala and Spark Tutorial

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle Big Data Fundamentals Ed 1 NEW

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Hadoop & Spark Using Amazon EMR

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

From GWS to MapReduce: Google s Cloud Technology in the Early Days

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Databricks. A Primer

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data and Scripting Systems build on top of Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Project DALKIT (informal working title)

Outils informatiques pour le Big Data en astronomie

Databricks. A Primer

Journée Thématique Big Data 13/03/2015

In Memory Accelerator for MongoDB

Introduction to Big Data with Apache Spark UC BERKELEY

Large scale processing using Hadoop. Ján Vaňo

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Energy Efficient MapReduce

Apache Hadoop. Alexandru Costan

Unlocking the True Value of Hadoop with Open Data Science

Hadoop Architecture. Part 1

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloud Computing and Big Data Processing

Dell In-Memory Appliance for Cloudera Enterprise

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Transcription:

Distributed Computing" with Open-Source Software Reza Zadeh Presented at Infosys OSSmosis

Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and web industry How do we program these things?

Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Machine Learning Example Current State of Spark Ecosystem

Data flow vs. Traditional network programming

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How to split problem across nodes? Must consider network & data locality» How to deal with failures? (inevitable at scale)» Even worse: stragglers» Ethernet networking not fast» Have to write programs per machine Rarely used in commodity datacenters

Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators» System picks how to split each operator into tasks and where to run each task» Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce

Example MapReduce Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

Why Use a Data Flow Engine? Ease of programming» High-level functions instead of message passing Wide deployment» More common than MPI, especially near data Scalability to very largest clusters Examples: " Pig, Hive,Scalding, Storm

Limitations of MapReduce

Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing» State between steps goes to distributed file system» Slow due to replication & disk storage

Example: Iterative Apps Input file system" read file system" write file system" read file system" write iter. 1 iter. 2... file system" read query 1 query 2 result 1 result 2 Input query 3... result 3 Commonly spend 90% of time doing I/O

Result While MapReduce is simple, it can require asymptotically more communication or I/O

Spark computing engine

Spark Computing Engine Extends a programming language with a distributed collection data-structure» Resilient distributed datasets (RDD) Open source at Apache» Most active community in big data, " with 50+ companies contributing Clean APIs in Java, Scala, Python Community: SparkR

!! Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T val sc = new SparkContext()! val lines = sc.textfile("log.txt") // RDD[String]! // Transform using standard collection operations! val errors = lines.filter(_.startswith("error"))! val messages = errors.map(_.split( \t )(2))! messages.saveastextfile("errors.txt")! lazily evaluated kicks off a computation

Key Idea Resilient Distributed Datasets (RDDs)» Collections of objects across a cluster with user controlled partitioning & storage (memory, disk,...)» Built via parallel transformations (map, filter, )» The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure

Python, Java, Scala, R // Scala: val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() // Java (better in java8!): JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce Input file

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce Input file

Machine Learning example

Logistic Regression data = spark.textfile(...).map(readpoint).cache() w = numpy.random.rand(d) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(- p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w - = gradient print Final w: %s % w

Logistic Regression Results Running Time (s) 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations first iteration 80 s further iterations 1 s 110 s / iteration Hadoop Spark 100 GB of data on 50 m1.xlarge EC2 machines

Behavior with Less RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 0% 25% 50% 75% 100% % of working set in memory

Benefit for Users Same engine performs data extraction, model training and interactive queries Separate engines DFS read parse DFS write DFS read train DFS write DFS read query DFS write Spark DFS read parse train query DFS

State of the Spark ecosystem

Spark Community Most active open source " community in big data 200+ developers, " 50+ companies contributing 150 100 50 Contributors in past year 0

Project Activity 1600 1400 1200 1000 Spark 800 600 400 200 MapReduce YARN HDFS Storm 0 Commits Activity in past 6 months

Continuing Growth Contributors per month to Spark source: ohloh.net

A General Platform Standard libraries included with Spark Spark SQL structured Spark Streaming" real-time GraphX graph MLlib machine learning Spark Core

Conclusion Data flow engines are becoming an important platform for numerical algorithms While early models like MapReduce were inefficient, new ones like Spark close this gap More info: spark.apache.org