An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus haoyuan@tachyonnexus.com September 30, 2015 @ Strata and Hadoop World NYC 2015
Outline Open Source Introduction to Tachyon New Features Getting Involved 2
Outline Open Source Introduction to Tachyon New Features Getting Involved 3
History Started at UC Berkeley AMPLab From summer 2012 Same lab produced Apache Spark and Apache Mesos Open sourced April 2013 Apache License 2.0 Latest Release: Version 0.7.1 (August 2015) Deployed at > 100 companies 4
111 Contributors Growth 70 46 30 15 1 3 v0.1 Dec 12 v0.2 Apr 13 v0.3! Oct 13 v0.4! Feb 14 v0.5! Jul 14 v0.6! Mar 15 v0.7! Jul 15 5
Contributors Growth > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations 6
Contributors Growth One of the Fastest Growing Big Data Open Source Project 7
Thanks to Contributors and Users! 8
One Tachyon Production Deployment Example Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 9
Outline Open Source Introduction to Tachyon New Features Getting Involved 10
Tachyon is an Open Source Memory-centric Distributed Storage System 11
Why Tachyon? 12
Performance Trend: Memory is Fast RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response times 13
Price Trend: Memory is Cheaper source: jcmit.com 14
Realized by many 15
Is the Problem Solved? 16
Missing a Solution for the Storage Layer 17
A Use Case Example with - Fast, in-memory data processing framework Keep one in-memory copy inside JVM Track lineage of operations used to derive data Upon failure, use lineage to recompute data map Lineage Tracking join reduce filter map 18
Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 19
Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) block 1 block 3 Spark Job Spark mem block manager Hadoop MR Job YARN block 1 block 3 block 2 block 4 HDFS / Amazon S3 20
Issue 1 resolved with Tachyon Memory-speed data sharing among jobs in different execution engine & storage engine same process (fast writes) frameworks Spark Job Spark mem Hadoop MR Job YARN block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 21
Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 Spark Task Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 22
Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 crash Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 23
Issue 2 Cache loss when process crashes execution engine & storage engine same process crash block 1 block 3 block 2 block 4 HDFS / Amazon S3 24
Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process Spark Task Spark memory block manager block 1 block 3 block 2 block 4 Tachyon! HDFS / Amazon S3 in-memory 25
Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process crash block 11 block 33 block 2 block 44 Tachyon! HDFS in-memory disk block 1 block 3 block 2 block 4 HDFS / Amazon S3 26
Issue 3 In-memory Data Duplication & Java Garbage Collection execution engine & storage engine same process (duplication & GC) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 27
Issue 3 resolved with Tachyon No in-memory data duplication, much less GC execution engine & storage engine same process (no duplication & GC) Spark Job1 Spark mem Spark Job2 Spark mem block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 28
Previously Mentioned A memory-centric storage architecture Push lineage down to storage layer 29
Tachyon Memory-Centric Architecture 30
Tachyon Memory-Centric Architecture 31
Lineage in Tachyon 32
Outline Open Source Introduction to Tachyon New Features Getting Involved 33
1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34
2) Tachyon running in production environment, both in the Cloud and on Premise. 35
Use Case: Baidu Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 36
Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 37
Use Case: an Oil Company Framework: Spark Under Storage: GlusterFS Storage Media: MEM only Analyzing data in traditional storage 38
Use Case: a SAAS Company Framework: Spark Under Storage: S3 Storage Media: SSD only Elastic Tachyon deployment 39
What if data size exceeds memory capacity? 40
3) Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 41
Configurable Storage Tiers MEM only MEM + HHD SSD only 42
4) Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 43
Pin Data in Memory 44
5) Transparent Naming 45
6) Unified Namespace 46
More Features 7) Remote Write Support 8) Easy deployment with Mesos and Yarn 9) Initial Security Support 10) One Command Cluster Deployment 11) Metrics Reporting for Clients, Workers, and Master 47
12) More Under Storage Supports 48
Reported Tachyon Usage 49
Outline Open Source Introduction to Tachyon New Features Getting Involved 50
Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! JIRA New Contributor Tasks 51
Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source 52
53
Strata NYC 2015 Welcome to visit us at our booth #P18. Check out other Tachyon related talks. First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) 2:05pm 2:45pm Thursday, 10/01/2015 Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace Hosting) 2:05pm 2:45pm Thursday, 10/01/2015 54
Try Tachyon: http://tachyon-project.org Develop Tachyon: https://github.com/amplab/tachyon Meet Friends: http://www.meetup.com/tachyon Get News: http://goo.gl/mwb2sx Tachyon Nexus: http://www.tachyonnexus.com Contact us: haoyuan@tachyonnexus.com 55