Data Centric Computing Revisited

Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013

Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data growth 9000 8000 7000 Sensors & Devices Terabytes to exabytes of existing data to process Streaming data, milliseconds to seconds to respond 6000 5000 Social Media Volume Velocity 4000 VoIP 3000 Enterprise Data 2015 Structured, Variety unstructured, text, multimedia Veracity Uncertainty from inconsistency, ambiguities, etc. Big Data and High Performance Computing are driving systems requirements: Move the Compute to the Data!

Data Scale Data Scale Maximum Insight Requires Combining Deep and Reactive Analytics Exa Deep Analytics Directly integrating Reactive and Deep Analytics enables feedback-driven insight optimization Peta Hypotheses Deep Predictions High Performance Computing On Large Data Sets (Creating a World Model Context) Tera History Feedback High Performance Computing On Large Streams of Data (Analyzing Real Time against The World Model Context) Giga Mega Traditional Data Warehouse and Business Intelligence Observations Reality Fast Actions Integration Reactive Analytics Kilo yr mo wk day hr min sec ms s Occasional Frequent Real-time Decision Frequency

2020: The Context-Centric Future Streaming Data Text Data Multi-dimensional Millions of Analytics Massive parallelism, storage density, high-bandwidth, low-latency networks and other data-centric principles must be fundamental to the ultimate solution architecture. Time Series Geo Spatial Video & Image Relational Social Network Etc. Trillions of data sources Exabytes of Context Billions of Agents & User Applications

What is Driving the Explosive Growth of Big Data? Compute processing is becoming very cheap, allowing us to instrument everything More sensors (more sources of data) Increased resolution in sensor data (bigger data) Cheaper storage (saving more data) An increasingly networked world allows us to gather data quickly and cheaply Data can be centralized easily and can be acted on more effectively Mobile computing allows for newer ways to collect data Smartphones are equipped with a variety of sensors and can continuously collect data Growth in social media is driving more sharing of data 5

Big Data Workloads and Their Evolution Genomics The Human Genome Project took over 10 years to complete and cost over $3 billion The Next Generation Sequencers can do it in a few days for about $1000 and generate a terabyte of data. That means that big genomic centers can produce petabytes of data every month Oil and Gas Seismic exploration data is growing so fast it has to be primarily stored on Tape It is migrated to disk based storage before it can be operated on and then deleted Smart Utilities Many electric utility companies are wiring their customers with smart meters These smart meters generate 100,000 data points per month per customer Utility companies need to analyze all this data for capacity planning, pricing and future investment Financial Services Algorithmic trading and the requirement to be able to react quickly to changes in the market are driving the need for low latency access to data Telecommunications Mobile phones generate many CDRs related to each call, text or data usage Telecom providers must analyze billions of CDRs a day to improve quality, deliver services and to make investment decisions Real Time Traffic Management Uses a mixture of real time sensors and historical data to lower congestions, increase capacity and reduce emissions 6

Hardware and Software Challenges of Big Data Workloads Big Data storage has typically grown outside of enterprise storage control. This poses a serious management problem for data center managers to implement security control, audit capability, backup and archiving capability, centralized management of storage, etc. Growth of scale out systems in business has introduced the challenges of managing a large number of servers and big networks to commercial IT staff Big data workloads tend to not share infrastructure with other applications. This has caused businesses to duplicate infrastructure for their big data applications Adoption of a Map Reduce framework forces language and storage choices that may not be ideal for the application 7

Explosive Storage Growth Require New Storage Solutions From the dawn of civilization until 2003, humankind generated 5 exabytes of data. Now we produce 5 exabytes every two days and the pace is accelerating. Eric Schmidt, Executive Chairman, Google Picture of 5 MB IBM 305 hard drive being loaded into an airplane in 1956. The unit weighed 1000 Kg UPS stores more than 16 PB data, from deliveries to event planning Monster, the online careers company, stores 5 PB data, largely from nearly 40 million resumes Zynga stores 3 PB data on the gaming habits of nearly 300 million monthly online game players Facebook adds 7 PB storage every month onto its exabyte trove The Boeing 787 Dreamliner generates 1 TB data for every roundtrip, equating to hundreds of TB daily for the entire fleet CERN has collected more than 100 PB data from high-energy physics experiments over the past two decades, but 75 PB comes from the Large Hadron Collider in just the past three years* * K. Davies, Best Practices in Big Data Storage, Tabor Communications, April 2013 8

Technologies in Big Data Storage Architectures Businesses recognize the value of their data but to extract value out of it they must first tame the data deluge. They must store it efficiently, organize it and manage it before they can operate on it to gain meaningful insight Scale out data architecture can be an efficient and scalable way to add capacity and performance for Big Data solutions The astounding growth in data means that tape has become integral to lots of big data storage solutions High speed analytics and real time applications require low latency access to data and are incorporating flash based storage There is a need for capacity as well as performance which means that tiering of storage and the movement of data between the tiers is necessary Taking advantage of new storage technologies, like shingled magnetic recording (SMR), for creating really dense storage pools without sacrificing performance Processing of data is done by a variety of traditional and emerging workloads that have different access requirements but need to be managed seamlessly It is no longer enough to capture the data but increasingly important to collect context and annotate the data. This annotated context is used to pre process the data before analysis, make data management decisions, correlate data with other data sources, etc. 9

Using HPC to Help Big Data Enterprise-class Map Reduce Solution CUSTOMER REQUIREMENT: Leverage a shared distributed set of resources, and run a variety of heterogeneous compute and data intensive applications without the need to duplicate infrastructure Solution should be easy to deploy, guarantee high reliability and availability, should be easy to manage, and support multiple lines of business and applications Deploy a combined Platform Symphony Map Reduce + GPFS-FPO solution to realize dramatic performance improvements and financial savings while delivering a more robust and flexible solution Result: IBM Platform Symphony and GPFS-FPO can help accelerate Hadoop workloads while reducing cost and improving workload reliability 10

Execution Time (normalized) Using HPC to Help Big Data Enterprise-class Map Reduce Solution Key Benefits Platform Symphony Map Reduce Breakthrough Hadoop performance Deliver faster and more accurate analysis for Big Data applications by doing greater processing with less infrastructure Lower costs through reduction in infrastructure and administration overhead Enable business agility by supporting multiple groups and diverse workloads on a single shared cluster 2 HDFS GPFS GPFS-FPO GPFS-FPO allows coexistence of various analytic architectures Better overall performance for analytics Provides a more robust architecture with no single point of failure Provides POSIX compliance and end-toend data management capability Policy driven failure handling and faster recovery 20 15 10 5 0 HDFS GPFS CacheTest Execution Time (normalized) 1.5 1 0.5 0 Postmark Terasort 11

Using HPC to Help Big Data Use energy aware scheduling capability, developed to support the needs of the High End HPC customers, to deliver better energy management functions integrated in a big data solution Most big data workloads are based on a sockets communication API which does not provide a low latency transport. Exploit user space sockets to leverage RDMA and minimize stack overhead to deliver low latency messaging without changing the applications Use GPFS data management capabilities to provide a flexible storage architecture to meet the needs of different applications in the enterprise; big data & traditional 12