Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division
In this talk Big data storage: Current trends Issues with current storage options Evolution of storage to support big data applications Hadoop is not a solution to a data problem! 2
Big Data : The Storage Concerns! Volume Petascale / Exascale data Velocity Frequency of generation Variety Largely unstructured/semi structured Value Frequency of analysis Computation Model Parallel tasks, scale out architecture How much are you worth to Zuckerberg? 3
A typical big data ecosystem Data Mining and Analytics Applications High Level Language (e.g. Pig Latin, Hive QL) Structured databases e.g. HBase, Hive etc. Storage Framework (e.g. HDFS, Cassandra) Storage (DAS/Networked) 4
Big Data Storage Model 1 Centralized metadata node Datanodes store data in local disks Clients Client Name Talk to metadata node and then datanodes e.g. Hadoop Data Data Data 5
Big Data Storage Model 2 No centralized metadata node Client Datanodes store data in local disks Data Clients routed to appropriate node based on hash prefix Data Hash prefix based routing Data e.g. Cassandra Data 6
Computation Model Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Data + Compute Data + Compute Data + Compute Data + Compute 7
Big Data Storage Access Patterns Typically write once, read many times workloads Metadata lookups, object reads Large sized blocks/objects 64 MB to 128 MB (e.g. Hadoop -MR) Small sized accesses e.g. HBase, Cassandra Objects Files Get(), Put() Objects Local File System Files Local Disks 8
Issues with Existing Storage Architecture 9
DAS : Not so smart! Distributing data all over the cluster makes data management difficult Replicated data wastage of storage space Tightly coupled computation and storage Inflexible infrastructure 10
Networks vs Disks: The blame game Over the last decade Datacenter network speeds have dramatically improved 10 Gb/s Ethernet, optical networks Flat network topologies Soon.. 40 Gb/s, 100 Gb/s Ethernet will be common Disks are barely keeping up Take away: Data locality will no more be an issue! 11
Changing times, changing values Value of data is constantly changing Not all data is equally popular Recent analysis of large scale datacenters [1] Only 10-30% of data is most popular Differentiated storage for big data Impossible with DAS Needs sophisticated storage [1] Ananthanarayan et al., HotOS 2011. Least valuable data Most valuable data (frequency of analysis, time of generation etc.) 12
New applications, new requirements Traditionally Sequential access, large blocks Task-local data access, batch jobs Aging data, replication Remote accesses dominate Real time queries and online jobs Row/record accesses in indexed NoSQL databases e.g. Accumulo, Hypertable etc. 13
Revisiting Big Data Storage 14
Rethinking storage for big data Shared nothing DAS vs shared storage Management vs scalability Storage bandwidth and latency capacities Converging multiple storage silos. Primary Cluster Datacenter Analytics Cluster Storage Management Layer 15
Sharing is a virtue! Shared nothing is extreme, inefficient but scalable Shared storage resources Spindles, caches, network bandwidth Scale out storage systems Scale out object/block/file storage systems Shared Nothing Big Data Storage Traditional Enterprise Shared Storage 16
HA with performance guarantees Performance guarantees Latency, BW Data reliability and failure resilience guarantees Big data archival with relaxed performance numbers Compression/ deduplication Archival Low Perf. Storage Manager 17
Storage Federation Federated storage management Integrate multiple storage islands into an archipelago Varying performance/cost characteristics Seamless data migration Dynamic workload characteristics Cost/value model Storage Manager Software 18
Heterogeneous storage clients Primary workloads Offline batch processing analytics jobs Real time online analytics queries Primary workloads Real time Analytics Converged Storage System Offline Analytics 19
Data Management options Storage aware big data infrastructure Storage managing big data blocks Storage tracks blocks Dynamically migrates blocks Big data application aided storage Analytics and computation Storage System 20
Storage technology trends Flash : Flashcache, all flash arrays etc. Interleaved accesses Non-volatile Memory Low latency, persistent tier Fast SAN Fiber channel, 40 Gb/s iscsi etc. 21
Low level changes Revisit block device access semantics Objects files blocks interactions NVM / flash Access protocols, application modifications Shared caches, proportional caching Better I/O schedulers 22
Summary and Conclusion Needed: A change in big data storage perspective Converged storage solutions Changing big data application characteristics Emerging technologies and performance improvements Overhaul traditional disk access semantics and protocols 23