An Open Source Memory-Centric Distributed Storage System

An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus haoyuan@tachyonnexus.com September 30, 2015 @ Strata and Hadoop World NYC 2015

Outline Open Source Introduction to Tachyon New Features Getting Involved 2

History Started at UC Berkeley AMPLab From summer 2012 Same lab produced Apache Spark and Apache Mesos Open sourced April 2013 Apache License 2.0 Latest Release: Version 0.7.1 (August 2015) Deployed at > 100 companies 4

111 Contributors Growth 70 46 30 15 1 3 v0.1 Dec 12 v0.2 Apr 13 v0.3! Oct 13 v0.4! Feb 14 v0.5! Jul 14 v0.6! Mar 15 v0.7! Jul 15 5

Contributors Growth > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations 6

Contributors Growth One of the Fastest Growing Big Data Open Source Project 7

Thanks to Contributors and Users! 8

One Tachyon Production Deployment Example Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 9

Tachyon is an Open Source Memory-centric Distributed Storage System 11

Why Tachyon? 12

Performance Trend: Memory is Fast RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response times 13

Price Trend: Memory is Cheaper source: jcmit.com 14

Realized by many 15

Is the Problem Solved? 16

Missing a Solution for the Storage Layer 17

A Use Case Example with - Fast, in-memory data processing framework Keep one in-memory copy inside JVM Track lineage of operations used to derive data Upon failure, use lineage to recompute data map Lineage Tracking join reduce filter map 18

Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 19

Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) block 1 block 3 Spark Job Spark mem block manager Hadoop MR Job YARN block 1 block 3 block 2 block 4 HDFS / Amazon S3 20

Issue 1 resolved with Tachyon Memory-speed data sharing among jobs in different execution engine & storage engine same process (fast writes) frameworks Spark Job Spark mem Hadoop MR Job YARN block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 21

Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 Spark Task Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 22

Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 crash Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 23

Issue 2 Cache loss when process crashes execution engine & storage engine same process crash block 1 block 3 block 2 block 4 HDFS / Amazon S3 24

Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process Spark Task Spark memory block manager block 1 block 3 block 2 block 4 Tachyon! HDFS / Amazon S3 in-memory 25

Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process crash block 11 block 33 block 2 block 44 Tachyon! HDFS in-memory disk block 1 block 3 block 2 block 4 HDFS / Amazon S3 26

Issue 3 In-memory Data Duplication & Java Garbage Collection execution engine & storage engine same process (duplication & GC) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 27

Issue 3 resolved with Tachyon No in-memory data duplication, much less GC execution engine & storage engine same process (no duplication & GC) Spark Job1 Spark mem Spark Job2 Spark mem block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 28

Previously Mentioned A memory-centric storage architecture Push lineage down to storage layer 29

Tachyon Memory-Centric Architecture 30

Tachyon Memory-Centric Architecture 31

Lineage in Tachyon 32

1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34

2) Tachyon running in production environment, both in the Cloud and on Premise. 35

Use Case: Baidu Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 36

Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 37

Use Case: an Oil Company Framework: Spark Under Storage: GlusterFS Storage Media: MEM only Analyzing data in traditional storage 38

Use Case: a SAAS Company Framework: Spark Under Storage: S3 Storage Media: SSD only Elastic Tachyon deployment 39

What if data size exceeds memory capacity? 40

3) Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 41

Configurable Storage Tiers MEM only MEM + HHD SSD only 42

4) Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 43

Pin Data in Memory 44

5) Transparent Naming 45

6) Unified Namespace 46

More Features 7) Remote Write Support 8) Easy deployment with Mesos and Yarn 9) Initial Security Support 10) One Command Cluster Deployment 11) Metrics Reporting for Clients, Workers, and Master 47

12) More Under Storage Supports 48

Reported Tachyon Usage 49

Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! JIRA New Contributor Tasks 51

Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source 52

Strata NYC 2015 Welcome to visit us at our booth #P18. Check out other Tachyon related talks. First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) 2:05pm 2:45pm Thursday, 10/01/2015 Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace Hosting) 2:05pm 2:45pm Thursday, 10/01/2015 54

Try Tachyon: http://tachyon-project.org Develop Tachyon: https://github.com/amplab/tachyon Meet Friends: http://www.meetup.com/tachyon Get News: http://goo.gl/mwb2sx Tachyon Nexus: http://www.tachyonnexus.com Contact us: haoyuan@tachyonnexus.com 55