An Open Source Memory-Centric Distributed Storage System

Size: px

Start display at page:

Download "An Open Source Memory-Centric Distributed Storage System"

Leona Price
8 years ago
Views:

1 An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus September 30, Strata and Hadoop World NYC 2015

2 Outline Open Source Introduction to Tachyon New Features Getting Involved 2

3 Outline Open Source Introduction to Tachyon New Features Getting Involved 3

4 History Started at UC Berkeley AMPLab From summer 2012 Same lab produced Apache Spark and Apache Mesos Open sourced April 2013 Apache License 2.0 Latest Release: Version (August 2015) Deployed at > 100 companies 4

5 111 Contributors Growth v0.1 Dec 12 v0.2 Apr 13 v0.3! Oct 13 v0.4! Feb 14 v0.5! Jul 14 v0.6! Mar 15 v0.7! Jul 15 5

6 Contributors Growth > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations 6

7 Contributors Growth One of the Fastest Growing Big Data Open Source Project 7

8 Thanks to Contributors and Users! 8

9 One Tachyon Production Deployment Example Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 9

10 Outline Open Source Introduction to Tachyon New Features Getting Involved 10

11 Tachyon is an Open Source Memory-centric Distributed Storage System 11

12 Why Tachyon? 12

13 Performance Trend: Memory is Fast RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response times 13

14 Price Trend: Memory is Cheaper source: jcmit.com 14

15 Realized by many 15

16 Is the Problem Solved? 16

17 Missing a Solution for the Storage Layer 17

18 A Use Case Example with - Fast, in-memory data processing framework Keep one in-memory copy inside JVM Track lineage of operations used to derive data Upon failure, use lineage to recompute data map Lineage Tracking join reduce filter map 18

19 Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 19

20 Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) block 1 block 3 Spark Job Spark mem block manager Hadoop MR Job YARN block 1 block 3 block 2 block 4 HDFS / Amazon S3 20

21 Issue 1 resolved with Tachyon Memory-speed data sharing among jobs in different execution engine & storage engine same process (fast writes) frameworks Spark Job Spark mem Hadoop MR Job YARN block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 21

22 Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 Spark Task Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 22

23 Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 crash Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 23

24 Issue 2 Cache loss when process crashes execution engine & storage engine same process crash block 1 block 3 block 2 block 4 HDFS / Amazon S3 24

25 Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process Spark Task Spark memory block manager block 1 block 3 block 2 block 4 Tachyon! HDFS / Amazon S3 in-memory 25

26 Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process crash block 11 block 33 block 2 block 44 Tachyon! HDFS in-memory disk block 1 block 3 block 2 block 4 HDFS / Amazon S3 26

27 Issue 3 In-memory Data Duplication & Java Garbage Collection execution engine & storage engine same process (duplication & GC) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 27

28 Issue 3 resolved with Tachyon No in-memory data duplication, much less GC execution engine & storage engine same process (no duplication & GC) Spark Job1 Spark mem Spark Job2 Spark mem block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 28

29 Previously Mentioned A memory-centric storage architecture Push lineage down to storage layer 29

30 Tachyon Memory-Centric Architecture 30

31 Tachyon Memory-Centric Architecture 31

32 Lineage in Tachyon 32

33 Outline Open Source Introduction to Tachyon New Features Getting Involved 33

34 1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34

35 2) Tachyon running in production environment, both in the Cloud and on Premise. 35

36 Use Case: Baidu Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 36

37 Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 37

38 Use Case: an Oil Company Framework: Spark Under Storage: GlusterFS Storage Media: MEM only Analyzing data in traditional storage 38

39 Use Case: a SAAS Company Framework: Spark Under Storage: S3 Storage Media: SSD only Elastic Tachyon deployment 39

40 What if data size exceeds memory capacity? 40

41 3) Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 41

42 Configurable Storage Tiers MEM only MEM + HHD SSD only 42

43 4) Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 43

44 Pin Data in Memory 44

45 5) Transparent Naming 45

46 6) Unified Namespace 46

47 More Features 7) Remote Write Support 8) Easy deployment with Mesos and Yarn 9) Initial Security Support 10) One Command Cluster Deployment 11) Metrics Reporting for Clients, Workers, and Master 47

48 12) More Under Storage Supports 48

49 Reported Tachyon Usage 49

50 Outline Open Source Introduction to Tachyon New Features Getting Involved 50

51 Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! JIRA New Contributor Tasks 51

52 Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source 52

53 53

54 Strata NYC 2015 Welcome to visit us at our booth #P18. Check out other Tachyon related talks. First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) 2:05pm 2:45pm Thursday, 10/01/2015 Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace Hosting) 2:05pm 2:45pm Thursday, 10/01/

55 Try Tachyon: Develop Tachyon: Meet Friends: Get News: Tachyon Nexus: Contact us: 55

Tachyon: memory-speed data sharing

Tachyon: memory-speed data sharing Ali Ghodsi, Haoyuan (HY) Li, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley Memory trumps everything else RAM throughput increasing exponentially Disk throughput