HPC data becomes Big Data Peter Braam peter.braam@braamresearch.com
me 1983-2000 Academia Maths & Computer Science Entrepreneur with startups (5x) 4 startups sold Lustre emerged Held executive jobs with acquirers 2014 Independent, advise, research Advise SKA SDP @ Cambridge Research on automatic parallelization with Haskell community Help others Dec 2013 (C) 2013 Braam Research, All Rights Reserved 2
Contents Introduction market & key questions Some Big Data problems & Algorithms HPC storage Cloud storage Conclusions Dec 2013 (C) 2013 Braam Research, All Rights Reserved 3
Key questions & market trends Dec 2013 (C) 2013 Braam Research, All Rights Reserved 4
Two Questions Given an HPC storage system, how can it be used for Big Data Analysis? What storage platforms are candidates to meet HPC and Big Data requirements? Dec 2013 (C) 2013 Braam Research, All Rights Reserved 5
IDC market data Fact 2011 2013 % of sites using co-processors 28.2% 76.9% HPC sites performing big data analysis 67% % of compute cycles dedicated to big data 30% % of sites using cloud infrastructure for HPC 18.8% 23.5% Year over year growth in high density servers ($) 25.5% Year over year growth in servers ($) -6.2% Dec 2013 (C) 2013 Braam Research, All Rights Reserved 6
Other facts Flash and much faster persistent memory tiers are inevitably coming. Multiple software challenges arise from this Management of tiers Much faster storage software to keep up with devices Gap between disk and other system performance continues to increase There is embedded processing on servers with attached storage and client-server processing with clients networked to servers. Pros & cons somewhat unclear. Dec 2013 (C) 2013 Braam Research, All Rights Reserved 7
Big Data Problems & Algorithms Dec 2013 (C) 2013 Braam Research, All Rights Reserved 8
Big Data Problems samples Input generally from simulation or sensors Climate modeling simulate then Find the hottest day each year in Cape Town Find very low pressure spots (typhoons) on Earth Genomics, Astronomy Find patterns (e.g. strings, galaxies) in huge data sets Pre-process data at TB/sec rates Data management Move all files with data on a particular server Dec 2013 (C) 2013 Braam Research, All Rights Reserved 9
Big Data Problems samples 2 Social network, advertising & intelligence Most of these become graph problems, some very hard Non-compliance in stock market transaction logs Replace legacy consumer information data warehousing with modern analytics Replacements of Teradata / Netezza sometimes difficult Modern platforms lack easy to use analytics language Dec 2013 (C) 2013 Braam Research, All Rights Reserved 10
Wide variations Some problems (e.g. some graph problems) must be executed in RAM. Graph500 benchmark 2000x speedup in 2.5 years Other problems require many iterations through disk-resident data Netezza analytics systems use FPGA s for accelerated streaming (e.g. filtering, compressing) Dec 2013 (C) 2013 Braam Research, All Rights Reserved 11
Big Data Algorithms Considerable variation Machine learning Bayesian analysis Indexing, sorting DB like Graph algorithms Maximal Information Coefficients generalize regressions Compressed sensing (aka sparse recovery) Topological Data Analysis Dec 2013 (C) 2013 Braam Research, All Rights Reserved 12
Ogres Analogously to Berkeley Dwarfs big data problems have been classified: see Understanding Big Data Applications and Architectures 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Geoffrey Fox Judy Qiu Shantenu Jha (Rutgers) Dec 2013 (C) 2013 Braam Research, All Rights Reserved 13
So Given these variations a single architecture is not likely to address all big data problems well. Dec 2013 (C) 2013 Braam Research, All Rights Reserved 14
HPC Storage Dec 2013 (C) 2013 Braam Research, All Rights Reserved 15
HPC data Traditional model cluster file system and Single Shared File (with # cores readers / writers) File Per Process (and 1 process per core ) Tightly coupled problems allow little scheduling of tasklets or redistribution of I/O Problems Throughput == #server nodes x (speed of slowest node) Very sensitive to component variation Monitoring tools fail to root cause Dec 2013 (C) 2013 Braam Research, All Rights Reserved 16
Results quite reasonable Systems like Lustre, GPFS, Panasas Use carefully configured and tested hardware Fast networks Deliver 80% of slowest hardware component Pipelines from clients to disk are uniformly wide Servers can deliver ~3GB/sec / controller Achilles heels: Metadata Availability Data management Dec 2013 (C) 2013 Braam Research, All Rights Reserved 17
A sample of hard cases First write then read. Why the gap? Opening & creating files is too slow. Should run >2x faster! First seen at ORNL in 2006. Metadata performance on Sequoia and on Cove (50 & 5 SSD drives) Low 1000 s to ~15K ops / sec Maximum seen ever ~50K ops Dec 2013 (C) 2013 Braam Research, All Rights Reserved 18
HPC hard cases ctd Larger numbers of concurrent metadata clients are not easy. Conclusion: 1. Problems systems like Lustre remain 2. Sensitivity to uniformly good hardware 3. Honest data from the users & understanding exists 4. It has been used at very large scale Acknowledgement: graphs from a variety of presentations given at LADD 2013 Dec 2013 (C) 2013 Braam Research, All Rights Reserved 19
Cloud data into HPC file system Intel s FastForward project Ingest massive ACG graphs through Hadoop Represent ACG using an HDF5 adaptation layer (HAL) & in Lustre DAOS objects. Then compute. Acknowledgement: Figure from Intel s hpdd.intel.com wiki Dec 2013 (C) 2013 Braam Research, All Rights Reserved 20
Cloud Storage Dec 2013 (C) 2013 Braam Research, All Rights Reserved 21
Hybrid solutions may be best TACC Wrangler system Big Data companion to Stampede DSSD storage is PCI connected and has KV interface 120 node Dell cluster with DSSD storage 275M IOPS Undoubtedly This will solve many big data problems well There will be problems that don t fit or for which flash is too slow Dec 2013 (C) 2013 Braam Research, All Rights Reserved 22
Typical Cloud Storage Combines memcached key value stores or DB s Relational, Distributed Key Value, Embedded Key Value MySql, Cassandra / Hbase, Rocksdb / LevelDB object stores (swift, CEPH, ) Results Read heavy loads from one cluster 100 s of servers serving 10M s of requests/sec Only the embedded DBs keep up with flash and NVRam Flash means: ~10us / read or write, RAM means <1us Flexible schemas for metadata Dec 2013 (C) 2013 Braam Research, All Rights Reserved 23
Manageability AWS elastic cloud master piece Open source solutions do similar Cassandra, CEPH, OpenStack Dec 2013 (C) 2013 Braam Research, All Rights Reserved 24
Tiered storage When is tiered storage important? For HPC dumping RAM requires flash cache Likely of increased importance: L1,2,3 PCM Flash Disk Tape Tiered storage can use container concept Cache misses fetch a container to faster memory High bandwidth transfers container relatively quickly One time latency e.g. 1 sec Then speed of faster tiers Key Point: neither cloud nor HPC has this now Dec 2013 (C) 2013 Braam Research, All Rights Reserved 25
Cloud object stores - CEPH Object is file with an id not with a name CEPH manages Removal and addition of storage Failed nodes, racks Quite clever load balancing and data placement CRUSH data placement perfect for management Dec 2013 (C) 2013 Braam Research, All Rights Reserved 26
Cloud objects still to demonstrate HPC bandwidth == #nodes x BW/node only limited testing at scale, no models Not yet clear: how it integrates with tiered storage Dealing with mixed workloads Dec 2013 (C) 2013 Braam Research, All Rights Reserved 27
Data layout - placement How to place many stripes? Bottleneck in RAID arrays: Rebuild a drive goes at rate of BW of 1 drive takes days Parity de-clustering & distributed spare Rebuild at BW of N drives (N = 60 / 600 / 6000?) For e.g. 10+2 redundancy, speedup 60/10, 600/10, etc. Benefit is large 5x 100x+ Algorithms & math is hard: block mappings Somewhat unproven for HPC loads Cloud objects have a form of parity declustering Dec 2013 (C) 2013 Braam Research, All Rights Reserved 28
Data layout erasure codes How to rebuild a single stripe faster Generalizes RAID, Solomon-Reed codes etc. Benefits stripe reconstruction I/O 1-2x Tons of attention and publications If the network is the slowest component this is important parity de-clustering is hard on network Dec 2013 (C) 2013 Braam Research, All Rights Reserved 29
Conclusions Dec 2013 (C) 2013 Braam Research, All Rights Reserved 30
Conclusions There are many Big Data algorithms There are many cloud storage solutions Big data on HPC several vendors New specialized solutions (DSSD) More attention for modeling the problems & solutions Inevitably mileage will vary depending on the problem. Dec 2013 (C) 2013 Braam Research, All Rights Reserved 31
Thank you Dec 2013 (C) 2013 Braam Research, All Rights Reserved 32