Tachyon: memory-speed data sharing



Similar documents
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

An Open Source Memory-Centric Distributed Storage System

Spark: Making Big Data Interactive & Real-Time

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Beyond Hadoop with Apache Spark and BDAS

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Spark and Distributed Programming

CSE-E5430 Scalable Cloud Computing Lecture 11

Rakam: Distributed Analytics API

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Analytics Pipeline. 连 城

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

What s next for the Berkeley Data Analytics Stack?

Moving From Hadoop to Spark

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

How Companies are! Using Spark

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Introduction to Spark

Mambo Running Analytics on Enterprise Storage

CS 294: Big Data System Research: Trends and Challenges

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Analytics Hadoop and Spark

Architectures for massive data management

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Significantly Speed up real world big data Applications using Apache Spark

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

The Berkeley AMPLab - Collaborative Big Data Research

GraySort on Apache Spark by Databricks

An Architecture for Fast and General Data Processing on Large Clusters

Improving MapReduce Performance in Heterogeneous Environments

Spark Application on AWS EC2 Cluster:

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Next-Gen Big Data Analytics using the Spark stack

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data and Scripting Systems beyond Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Real Time Data Processing using Spark Streaming

Big Data Processing. Patrick Wendell Databricks

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Ali Ghodsi Head of PM and Engineering Databricks

Brave New World: Hadoop vs. Spark

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Conquering Big Data with Apache Spark

Introduction to Big Data with Apache Spark UC BERKELEY

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Dell In-Memory Appliance for Cloudera Enterprise

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Big Fast Data Hadoop acceleration with Flash. June 2013

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Machine- Learning Summer School

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop Architecture. Part 1

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Productionizing a 24/7 Spark Streaming Service on YARN

Big Data Analytics. Lucas Rego Drumond

The Berkeley Data Analytics Stack: Present and Future

Spark and the Big Data Library

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Data Science in the Wild

The Stratosphere Big Data Analytics Platform

HiBench Introduction. Carson Wang Software & Services Group

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Streaming items through a cluster with Spark Streaming

Apache Flink Next-gen data analysis. Kostas

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

6. How MapReduce Works. Jari-Pekka Voutilainen

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

VariantSpark: Applying Spark-based machine learning methods to genomic information

Hadoop & Spark Using Amazon EMR

MLlib: Scalable Machine Learning on Spark

Spark Job Server. Evan Chan and Kelvin Chu. Date

Making Sense at Scale with the Berkeley Data Analytics Stack

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop: Embracing future hardware

Developing Scalable Java Applications with Cacheonix

Job Scheduling for MapReduce

Storage Architectures for Big Data in the Cloud

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem B Y R A H I M A.

Big Data Research in the AMPLab: BDAS and Beyond

Characterizing Task Usage Shapes in Google s Compute Clusters

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Transcription:

Tachyon: memory-speed data sharing Ali Ghodsi, Haoyuan (HY) Li, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley

Memory trumps everything else RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response time

Realized by many Frameworks already leverage memory e.g. Spark, Piccolo, GraphX

Example: - Fast in-memory data processing within a job Keep only one copy in-memory copy JVM Track lineage of operations used to derive data Upon failure, use lineage to re-compute data map Lineage Tracking join reduce filter map

Challenge 1 execution engine & storage engine same JVM process block 3 Spark Task Spark memory block manager block 3 block 2 block 4 HDFS disk

Challenge 1 execution engine & storage engine same JVM process block 3 crash Spark memory block manager block 3 block 2 block 4 HDFS disk

Challenge 1 JVM crash: lose all cache execution engine & storage engine same JVM process crash block 3 block 2 block 4 HDFS disk

Challenge 2 JVM heap overhead: GC & duplicate memory per job execution engine & storage engine same JVM process (GC & duplication) block 3 Spark Task Spark mem block manager block 3 Block 1 Spark Task Spark mem block manager block 3 block 2 block 4 HDFS disk

Challenge 3 Different jobs share data: Slow writes to disk storage engine & execution engine same JVM process (slow writes) block 3 Spark Task Spark mem block manager block 3 Spark Task Spark mem block manager block 3 block 2 block 4 HDFS disk

Challenge 3 Different frameworks share data: Slow writes to disk storage engine & execution engine same JVM process (slow writes) block 3 Spark Task Spark mem block manager Hadoop MR YARN block 3 block 2 Block 4 HDFS disk

Tachyon Reliable data sharing at memory-speed within and across cluster frameworks/jobs

Challenge 1 revisited execution engine & storage engine same JVM process Spark Task Spark memory block manager block 2 block 3 block 4 Tachyon HDFS in-memory disk

Challenge 1 revisited execution engine & storage engine same JVM process crash Spark memory block manager block 2 block 3 block 4 Tachyon HDFS disk in-memory block 3 block 2 block 4 HDFS disk

Challenge 1 revisited JVM crash: keep memory-cache execution engine & storage engine same JVM process crash block 2 block 3 block 4 Tachyon HDFS disk in-memory block 3 block 2 block 4 HDFS disk

Challenge 2 revisited Off-heap memory storage No GC & one memory copy execution engine & storage engine same JVM process (no GC & duplication) Spark Task Spark mem block 4 Spark Task Spark mem block 2 block 3 block 4 Tachyon HDFS in-memory disk Block 3 block 2 Block 4 HDFS disk

Challenge 3 revisited Different frameworks share at memory-speed execution engine & storage engine same JVM process (fast writes) Spark Task Spark mem Hadoop MR YARN block 2 block 3 block 4 Tachyon HDFS in-memory disk Block 3 block 2 Block 4 HDFS disk

Tachyon and Spark Spark s of off-jvm-heap RDD-store In-memory RDDs (serialized) Fault-tolerant cache Enables avoiding GC overhead fine-grained executors fast RDD sharing

Tachyon research vision Vision Push lineage down to storage layer Use memory aggressively Approach One in-memory copy Rely on recomputation for fault-tolerance

Architecture

Comparison with in Memory HDFS

Further Improve Spark s Performance Grep

Master Faster Recovery

Open Source Status New release V0.4.0 (July 2014) 20 Developers (7 from Berkeley, 13 from outside) 11 Companies Writes go synchronously to under filesystem (No lineage information in Developer Preview release) MapReduce and Spark can run without any code change (ser/de becomes the new bottleneck)

Spark Using HDFS vs Tachyon val file = sc.textfile( hdfs://ip:port/path ) Shark CREATE TABLE orders_cached AS SELECT * FROM orders; Hadoop MapReduce hadoop jar examples.jar wordcount hdfs://localhost/input hdfs://localhost/output

Spark Using HDFS vs Tachyon val file = sc.textfile( tachyon://ip:port/path ) Shark CREATE TABLE orders_tachyon AS SELECT * FROM orders; Hadoop MapReduce hadoop jar examples.jar wordcount tachyon://localhost/input tachyon://localhost/output

Thanks to Redhat!

Future Research Focus Integration with HDFS caching Memory Fair Sharing Random Access Abstraction Mutable Data Support

Acknowledgments Calvin Jia, Nick Lanham, Grace Huang, Mark Hamstra, Bill Zhao, Rong Gu, Hobin Yoon, Vamsi Chitters, Joseph Jin-Chuan Tang, Xi Liu, Qifan Pu, Aslan Bekirov, Reynold Xin, Xiaomin Zhang, Achal Soni, Xiang Zhong, Dilip Joseph, Srinivas Parayya, Tim St. Clair, Shivaram Venkataraman, Andrew Ash

Tachyon Summary As more workloads move into memory, big data data sharing across frameworks will become a bottleneck Tachyon provides in-memory, fault-tolerant data sharing across frameworks

Thanks! More: https://github.com/amplab/tachyon