An Open Source Memory-Centric Distributed Storage System

Similar documents
Tachyon: memory-speed data sharing

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

How To Create A Data Visualization With Apache Spark And Zeppelin

Ali Ghodsi Head of PM and Engineering Databricks

Next-Gen Big Data Analytics using the Spark stack

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Unified Big Data Processing with Apache Spark. Matei

Mambo Running Analytics on Enterprise Storage

Moving From Hadoop to Spark

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Conquering Big Data with BDAS (Berkeley Data Analytics)

What s next for the Berkeley Data Analytics Stack?

HDFS 2015: Past, Present, and Future

Hadoop & Spark Using Amazon EMR

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Processing NGS Data with Hadoop-BAM and SeqPig

DataStax Enterprise, powered by Apache Cassandra (TM)

Cisco Data Preparation

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Hadoop: Embracing future hardware

How Companies are! Using Spark

Big Data Performance Growth on the Rise

Big Data Processing. Patrick Wendell Databricks

The Future of Data Management

Significantly Speed up real world big data Applications using Apache Spark

Big Data Research in the AMPLab: BDAS and Beyond

Spark: Making Big Data Interactive & Real-Time

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Hadoop & its Usage at Facebook

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

DataStax Enterprise Reference Architecture

Upcoming Announcements

Hierarchy storage in Tachyon.

The Berkeley AMPLab - Collaborative Big Data Research

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions

HDP Enabling the Modern Data Architecture

Big Data Analytics - Accelerated. stream-horizon.com

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Fast Data Hadoop acceleration with Flash. June 2013

Platfora Big Data Analytics

Big Data Trends and HDFS Evolution

Scaling Out With Apache Spark. DTL Meeting Slides based on

Beyond Hadoop with Apache Spark and BDAS

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Distributed File Systems

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Workshop on Hadoop with Big Data

The Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Application Development. A Paradigm Shift

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

vrops Microsoft SQL Server MANAGEMENT PACK OVERVIEW

INTRODUCING APACHE IGNITE An Apache Incubator Project

Oracle Big Data SQL Technical Update

NextGen Infrastructure for Big DATA Analytics.

Hadoop-BAM and SeqPig

How To Choose A Data Flow Pipeline From A Data Processing Platform

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Extended Attributes and Transparent Encryption in Apache Hadoop

Hadoop Architecture. Part 1

Private Cloud Storage for Media Applications. Bang Chang Vice President, Broadcast Servers and Storage

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

HDP Hadoop From concept to deployment.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

THE JOURNEY TO A DATA LAKE

Large scale processing using Hadoop. Ján Vaňo

Big Data - Infrastructure Considerations

Hadoop Size does Hadoop Summit 2013

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CS 294: Big Data System Research: Trends and Challenges

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Transcription:

An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus haoyuan@tachyonnexus.com September 30, 2015 @ Strata and Hadoop World NYC 2015

Outline Open Source Introduction to Tachyon New Features Getting Involved 2

Outline Open Source Introduction to Tachyon New Features Getting Involved 3

History Started at UC Berkeley AMPLab From summer 2012 Same lab produced Apache Spark and Apache Mesos Open sourced April 2013 Apache License 2.0 Latest Release: Version 0.7.1 (August 2015) Deployed at > 100 companies 4

111 Contributors Growth 70 46 30 15 1 3 v0.1 Dec 12 v0.2 Apr 13 v0.3! Oct 13 v0.4! Feb 14 v0.5! Jul 14 v0.6! Mar 15 v0.7! Jul 15 5

Contributors Growth > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations 6

Contributors Growth One of the Fastest Growing Big Data Open Source Project 7

Thanks to Contributors and Users! 8

One Tachyon Production Deployment Example Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 9

Outline Open Source Introduction to Tachyon New Features Getting Involved 10

Tachyon is an Open Source Memory-centric Distributed Storage System 11

Why Tachyon? 12

Performance Trend: Memory is Fast RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response times 13

Price Trend: Memory is Cheaper source: jcmit.com 14

Realized by many 15

Is the Problem Solved? 16

Missing a Solution for the Storage Layer 17

A Use Case Example with - Fast, in-memory data processing framework Keep one in-memory copy inside JVM Track lineage of operations used to derive data Upon failure, use lineage to recompute data map Lineage Tracking join reduce filter map 18

Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 19

Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) block 1 block 3 Spark Job Spark mem block manager Hadoop MR Job YARN block 1 block 3 block 2 block 4 HDFS / Amazon S3 20

Issue 1 resolved with Tachyon Memory-speed data sharing among jobs in different execution engine & storage engine same process (fast writes) frameworks Spark Job Spark mem Hadoop MR Job YARN block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 21

Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 Spark Task Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 22

Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 crash Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 23

Issue 2 Cache loss when process crashes execution engine & storage engine same process crash block 1 block 3 block 2 block 4 HDFS / Amazon S3 24

Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process Spark Task Spark memory block manager block 1 block 3 block 2 block 4 Tachyon! HDFS / Amazon S3 in-memory 25

Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process crash block 11 block 33 block 2 block 44 Tachyon! HDFS in-memory disk block 1 block 3 block 2 block 4 HDFS / Amazon S3 26

Issue 3 In-memory Data Duplication & Java Garbage Collection execution engine & storage engine same process (duplication & GC) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 27

Issue 3 resolved with Tachyon No in-memory data duplication, much less GC execution engine & storage engine same process (no duplication & GC) Spark Job1 Spark mem Spark Job2 Spark mem block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 28

Previously Mentioned A memory-centric storage architecture Push lineage down to storage layer 29

Tachyon Memory-Centric Architecture 30

Tachyon Memory-Centric Architecture 31

Lineage in Tachyon 32

Outline Open Source Introduction to Tachyon New Features Getting Involved 33

1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34

2) Tachyon running in production environment, both in the Cloud and on Premise. 35

Use Case: Baidu Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 36

Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 37

Use Case: an Oil Company Framework: Spark Under Storage: GlusterFS Storage Media: MEM only Analyzing data in traditional storage 38

Use Case: a SAAS Company Framework: Spark Under Storage: S3 Storage Media: SSD only Elastic Tachyon deployment 39

What if data size exceeds memory capacity? 40

3) Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 41

Configurable Storage Tiers MEM only MEM + HHD SSD only 42

4) Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 43

Pin Data in Memory 44

5) Transparent Naming 45

6) Unified Namespace 46

More Features 7) Remote Write Support 8) Easy deployment with Mesos and Yarn 9) Initial Security Support 10) One Command Cluster Deployment 11) Metrics Reporting for Clients, Workers, and Master 47

12) More Under Storage Supports 48

Reported Tachyon Usage 49

Outline Open Source Introduction to Tachyon New Features Getting Involved 50

Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! JIRA New Contributor Tasks 51

Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source 52

53

Strata NYC 2015 Welcome to visit us at our booth #P18. Check out other Tachyon related talks. First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) 2:05pm 2:45pm Thursday, 10/01/2015 Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace Hosting) 2:05pm 2:45pm Thursday, 10/01/2015 54

Try Tachyon: http://tachyon-project.org Develop Tachyon: https://github.com/amplab/tachyon Meet Friends: http://www.meetup.com/tachyon Get News: http://goo.gl/mwb2sx Tachyon Nexus: http://www.tachyonnexus.com Contact us: haoyuan@tachyonnexus.com 55