Data Centric Computing Revisited

Similar documents
Data Centric Systems (DCS)

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

Beyond Watson: The Business Implications of Big Data

A New Era Of Analytic

How To Handle Big Data With A Data Scientist

IBM ELASTIC STORAGE SEAN LEE

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Big Data: Study in Structured and Unstructured Data

The Future of Data Management

IBM System x reference architecture solutions for big data

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Big Data, Integration and Governance: Ask the Experts

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

The Rise of Industrial Big Data

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

NextGen Infrastructure for Big DATA Analytics.

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Taming Big Data Storage with Crossroads Systems StrongBox

Unlock the value of data with smarter storage solutions.

Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows

The Enterprise Data Hub and The Modern Information Architecture

Big Data Performance Growth on the Rise

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Big Data: Image & Video Analytics

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Use Cases Update

How To Make Data Streaming A Real Time Intelligence

IBM Data Warehousing and Analytics Portfolio Summary

BIG DATA TRENDS AND TECHNOLOGIES

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Machina Research. Where is the value in IoT? IoT data and analytics may have an answer. Emil Berthelsen, Principal Analyst April 28, 2016

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

The HP IT Transformation Story

SECURITY MEETS BIG DATA. Achieve Effectiveness And Efficiency. Copyright 2012 EMC Corporation. All rights reserved.

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

Crossing the Performance Chasm with OpenPOWER

Time Value of Data. Creating an active archive strategy to address both archive and backup in the midst of data explosion.

Big Data Driven Knowledge Discovery for Autonomic Future Internet

BIG DATA THE NEW OPPORTUNITY

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Fast Data Hadoop acceleration with Flash. June 2013

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

From Internet Data Centers to Data Centers in the Cloud

Big data management with IBM General Parallel File System

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

Transforming the Telecoms Business using Big Data and Analytics

Networking in the Hadoop Cluster

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Big + Fast + Safe + Simple = Lowest Technical Risk

HadoopTM Analytics DDN

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER

There s no way around it: learning about Big Data means

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

DDN updates object storage platform as it aims to break out of HPC niche

Simple. Extensible. Open.

Make the Most of Big Data to Drive Innovation Through Reseach

white paper Big Data for Small Business Why small to medium enterprises need to know about Big Data and how to manage it Sponsored by:

Hadoop for Enterprises:

Why Big Data in the Cloud?

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Solutions for Communications with IBM Netezza Network Analytics Accelerator

Protecting Big Data Data Protection Solutions for the Business Data Lake

A Best Practice Guide to Archiving Persistent Data: How archiving is a vital tool as part of a data center cost savings exercise

A New Era of Computing

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

High Performance Computing OpenStack Options. September 22, 2015

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Big Data. Fast Forward. Putting data to productive use

Il mondo dei DB Cambia : Tecnologie e opportunita`

Building a Scalable Big Data Infrastructure for Dynamic Workflows

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Red Hat Storage Server

Hadoop. Sunday, November 25, 12

Luncheon Webinar Series May 13, 2013

The Principles of the Business Data Lake

Transcription:

Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013

Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data growth 9000 8000 7000 Sensors & Devices Terabytes to exabytes of existing data to process Streaming data, milliseconds to seconds to respond 6000 5000 Social Media Volume Velocity 4000 VoIP 3000 Enterprise Data 2015 Structured, Variety unstructured, text, multimedia Veracity Uncertainty from inconsistency, ambiguities, etc. Big Data and High Performance Computing are driving systems requirements: Move the Compute to the Data!

Data Scale Data Scale Maximum Insight Requires Combining Deep and Reactive Analytics Exa Deep Analytics Directly integrating Reactive and Deep Analytics enables feedback-driven insight optimization Peta Hypotheses Deep Predictions High Performance Computing On Large Data Sets (Creating a World Model Context) Tera History Feedback High Performance Computing On Large Streams of Data (Analyzing Real Time against The World Model Context) Giga Mega Traditional Data Warehouse and Business Intelligence Observations Reality Fast Actions Integration Reactive Analytics Kilo yr mo wk day hr min sec ms s Occasional Frequent Real-time Decision Frequency

2020: The Context-Centric Future Streaming Data Text Data Multi-dimensional Millions of Analytics Massive parallelism, storage density, high-bandwidth, low-latency networks and other data-centric principles must be fundamental to the ultimate solution architecture. Time Series Geo Spatial Video & Image Relational Social Network Etc. Trillions of data sources Exabytes of Context Billions of Agents & User Applications

What is Driving the Explosive Growth of Big Data? Compute processing is becoming very cheap, allowing us to instrument everything More sensors (more sources of data) Increased resolution in sensor data (bigger data) Cheaper storage (saving more data) An increasingly networked world allows us to gather data quickly and cheaply Data can be centralized easily and can be acted on more effectively Mobile computing allows for newer ways to collect data Smartphones are equipped with a variety of sensors and can continuously collect data Growth in social media is driving more sharing of data 5

Big Data Workloads and Their Evolution Genomics The Human Genome Project took over 10 years to complete and cost over $3 billion The Next Generation Sequencers can do it in a few days for about $1000 and generate a terabyte of data. That means that big genomic centers can produce petabytes of data every month Oil and Gas Seismic exploration data is growing so fast it has to be primarily stored on Tape It is migrated to disk based storage before it can be operated on and then deleted Smart Utilities Many electric utility companies are wiring their customers with smart meters These smart meters generate 100,000 data points per month per customer Utility companies need to analyze all this data for capacity planning, pricing and future investment Financial Services Algorithmic trading and the requirement to be able to react quickly to changes in the market are driving the need for low latency access to data Telecommunications Mobile phones generate many CDRs related to each call, text or data usage Telecom providers must analyze billions of CDRs a day to improve quality, deliver services and to make investment decisions Real Time Traffic Management Uses a mixture of real time sensors and historical data to lower congestions, increase capacity and reduce emissions 6

Hardware and Software Challenges of Big Data Workloads Big Data storage has typically grown outside of enterprise storage control. This poses a serious management problem for data center managers to implement security control, audit capability, backup and archiving capability, centralized management of storage, etc. Growth of scale out systems in business has introduced the challenges of managing a large number of servers and big networks to commercial IT staff Big data workloads tend to not share infrastructure with other applications. This has caused businesses to duplicate infrastructure for their big data applications Adoption of a Map Reduce framework forces language and storage choices that may not be ideal for the application 7

Explosive Storage Growth Require New Storage Solutions From the dawn of civilization until 2003, humankind generated 5 exabytes of data. Now we produce 5 exabytes every two days and the pace is accelerating. Eric Schmidt, Executive Chairman, Google Picture of 5 MB IBM 305 hard drive being loaded into an airplane in 1956. The unit weighed 1000 Kg UPS stores more than 16 PB data, from deliveries to event planning Monster, the online careers company, stores 5 PB data, largely from nearly 40 million resumes Zynga stores 3 PB data on the gaming habits of nearly 300 million monthly online game players Facebook adds 7 PB storage every month onto its exabyte trove The Boeing 787 Dreamliner generates 1 TB data for every roundtrip, equating to hundreds of TB daily for the entire fleet CERN has collected more than 100 PB data from high-energy physics experiments over the past two decades, but 75 PB comes from the Large Hadron Collider in just the past three years* * K. Davies, Best Practices in Big Data Storage, Tabor Communications, April 2013 8

Technologies in Big Data Storage Architectures Businesses recognize the value of their data but to extract value out of it they must first tame the data deluge. They must store it efficiently, organize it and manage it before they can operate on it to gain meaningful insight Scale out data architecture can be an efficient and scalable way to add capacity and performance for Big Data solutions The astounding growth in data means that tape has become integral to lots of big data storage solutions High speed analytics and real time applications require low latency access to data and are incorporating flash based storage There is a need for capacity as well as performance which means that tiering of storage and the movement of data between the tiers is necessary Taking advantage of new storage technologies, like shingled magnetic recording (SMR), for creating really dense storage pools without sacrificing performance Processing of data is done by a variety of traditional and emerging workloads that have different access requirements but need to be managed seamlessly It is no longer enough to capture the data but increasingly important to collect context and annotate the data. This annotated context is used to pre process the data before analysis, make data management decisions, correlate data with other data sources, etc. 9

Using HPC to Help Big Data Enterprise-class Map Reduce Solution CUSTOMER REQUIREMENT: Leverage a shared distributed set of resources, and run a variety of heterogeneous compute and data intensive applications without the need to duplicate infrastructure Solution should be easy to deploy, guarantee high reliability and availability, should be easy to manage, and support multiple lines of business and applications Deploy a combined Platform Symphony Map Reduce + GPFS-FPO solution to realize dramatic performance improvements and financial savings while delivering a more robust and flexible solution Result: IBM Platform Symphony and GPFS-FPO can help accelerate Hadoop workloads while reducing cost and improving workload reliability 10

Execution Time (normalized) Using HPC to Help Big Data Enterprise-class Map Reduce Solution Key Benefits Platform Symphony Map Reduce Breakthrough Hadoop performance Deliver faster and more accurate analysis for Big Data applications by doing greater processing with less infrastructure Lower costs through reduction in infrastructure and administration overhead Enable business agility by supporting multiple groups and diverse workloads on a single shared cluster 2 HDFS GPFS GPFS-FPO GPFS-FPO allows coexistence of various analytic architectures Better overall performance for analytics Provides a more robust architecture with no single point of failure Provides POSIX compliance and end-toend data management capability Policy driven failure handling and faster recovery 20 15 10 5 0 HDFS GPFS CacheTest Execution Time (normalized) 1.5 1 0.5 0 Postmark Terasort 11

Using HPC to Help Big Data Use energy aware scheduling capability, developed to support the needs of the High End HPC customers, to deliver better energy management functions integrated in a big data solution Most big data workloads are based on a sockets communication API which does not provide a low latency transport. Exploit user space sockets to leverage RDMA and minimize stack overhead to deliver low latency messaging without changing the applications Use GPFS data management capabilities to provide a flexible storage architecture to meet the needs of different applications in the enterprise; big data & traditional 12