Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Similar documents
HPC and Large Scale Data Analytics. SOS17 Conference Jekyll Island, Georgia

A Data Warehouse Approach to Analyzing All the Data All the Time. Bill Blake Netezza Corporation April 2006

Cray: Enabling Real-Time Discovery in Big Data

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

The Fusion of Supercomputing and Big Data: The Role of Global Memory Architectures in Future Large Scale Data Analytics

IBM Netezza High Capacity Appliance

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Netezza and Business Analytics Synergy

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data Technologies Compared June 2014

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Cost-Effective Business Intelligence with Red Hat and Open Source

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

IBM PureData Systems. Robert Božič 2013 IBM Corporation

bigdata Managing Scale in Ontological Systems

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

How To Handle Big Data With A Data Scientist

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Architectures for Big Data Analytics A database perspective

Overview: X5 Generation Database Machines

Inge Os Sales Consulting Manager Oracle Norway

Scaling Your Data to the Cloud

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Information Architecture

NoSQL for SQL Professionals William McKnight

Hadoop. Sunday, November 25, 12

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Next-Generation Cloud Analytics with Amazon Redshift

High Performance Computing. Course Notes HPC Fundamentals

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data Analytics - Accelerated. stream-horizon.com

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

ANALYTICS IN BIG DATA ERA

NextGen Infrastructure for Big DATA Analytics.

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Understanding the Value of In-Memory in the IT Landscape

Big Data Processing: Past, Present and Future

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Parallel Computing. Benson Muite. benson.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

HP Oracle Database Platform / Exadata Appliance Extreme Data Warehousing

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Infrastructure Matters: POWER8 vs. Xeon x86

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Oracle Big Data SQL Technical Update

Cluster, Grid, Cloud Concepts

Main Memory Data Warehouses

Innovative technology for big data analytics

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

I/O Considerations in Big Data Analytics

Introducing Oracle Exalytics In-Memory Machine

2009 Oracle Corporation 1

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Next Generation Data Warehousing Appliances

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Parallel Data Warehouse

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Oracle Database 11g Comparison Chart

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

High Performance Computing in CST STUDIO SUITE

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

Cluster Implementation and Management; Scheduling

Big Fast Data Hadoop acceleration with Flash. June 2013

Cloud Computing and Advanced Relationship Analytics

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Rackspace Cloud Databases and Container-based Virtualization

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

BIG DATA TRENDS AND TECHNOLOGIES

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Bringing Big Data into the Enterprise

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data and Your Data Warehouse Philip Russom

James Serra Sr BI Architect

High Availability Databases based on Oracle 10g RAC on Linux

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

YarcData urika Technical White Paper

Using Tableau Software with Hortonworks Data Platform

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data Challenges in Bioinformatics

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Performance and Scalability Overview

In-memory computing with SAP HANA

Discovering Business Insights in Big Data Using SQL-MapReduce

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Transcription:

HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc.

The Big Data Challenge Supercomputing minimizes data movement The focus is loading the mesh in distributed memory, computing the answer with fast interconnects, and visualizing the answer in memory the high performance data movement is for loading, check pointing or archiving. Data-intensive computing is all about data movement The focus is scanning, sorting, streaming and aggregating "all the data all the time" to get the answer or discover new knowledge from unstructured/structured data sources. 6/30/2012 2

Big Data is Data in Motion Set the stage for the fusion of numerically intensive and data intensive computing in future Cray systems Build on Cray's success of delivering the most scalable systems through heterogeneous and specialized nodes Nodes not only optimized for compute, but also storage and network I/O, all connected with the highest level of interconnect performance. Add system capability to the edge of the fabric We see this effort as increasing key application performance with an "appliance style" approach using Cray's primary supercomputing products with extensions configured as optimized HW/SW stacks adding value around the edge of the high performance system network 6/30/2012 3

Maximum Scalability: System Node Specialization Key to Cray s MPP scalability is system node o/s specialization combined with very high bandwidth, low latency interconnects A very effective approach to appliance design: Netezza and Cray examples Service Compute File I/O Price Enterprise servers 100,000s units High-Performance Computing 1,000s units /home Users Net I/O PCs and Workstations 10,000,000s units System Support Sys Admin Performance Courtesy of Dr. Bill Camp, Sandia National Laboratories, circa 2000

Back in Time, at the Beginnings of the Web, there was Dr. Codd and his Relational Database Algebraic Set Theory

Analyzing Big Data Analytics: A Swim Lane View of Supporting Business and Technical Decisions Key Function Language Data Approach Airline Example OLTP Declarative (SQL) Structured (relational) ATM transactions Buying a seat on an airplane OLAP Ad Hoc Declarative (SQL+UDF) Structured (relational) BI aggregate and analyze bookings for new ad placements Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based) Analyze social graphs and infer who might travel where OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop files) Application Framework for weblog analysis Optimize Models Procedural (Solver Libs) Optimization <-> Simulation Complex Scheduling Estimating empty seats Simulate Models Procedural (Fortran, C++) Matrix Math (Systems of Eq s) Mathematical Modeling and simulation (design airplane)

For Perspective (1980 s) Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLTP Declarative (SQL) Structured (relational) The relational database was invented on a system that merged server, storage and database CPU Memory It was called a mainframe IOP IOP Change focus to vector processing and memory performance and the mainframe becomes a supercomputer

Processing millions of transactions without a hitch Do entire transaction or nothing (don t debit account without dispensing cash) Disciplined approach to data forms data models/schema tables on disk Tables at first glance look like the rows and columns of a spreadsheet Transaction processing uses highly repetitive queries for entering/updating OLTP: Processing Tables with Queries Key Function OLTP Language Declarative (SQL) Data Approach Structured (relational) SMP Server Cluster And MPP Cloud And Grid Web Scale

OLTP: Processing Tables with Queries A transaction touches only the data it needs to accomplish its purpose Since the workload involves many, many small data payloads, speedups can be gained through caching and the use of indices that prevent full table scans. The query states what it needs, and the database uses a (cost-based) planner/optimizer to create the program that actually manipulates the tables Mostly writing to the database

Then The Rules Changed Mainframes attacked by killer micros! Memory grew large I/O became weak System costs dropped Storage moved off to the network CPU CPU CPU CPU CPU CPU CPU CPU Very Large Memory I/O Storage Area Network

Capacity Was Added By Clustering CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory I/O Memory I/O I/O Memory I/O I/O Storage Area Network Memory Memory CPU CPU CPU CPU CPU CPU CPU CPU SAN limits Moving Data to the Processors

The Online Analytical Processing Problem Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Declarative (SQL+UDF ) Structured (relational) Business Intelligence Applications generate reports of aggregations Need to read at all the data all the time (telecom, retail, finance, advertising, etc) BI Analytics require ad hoc queries since you don t know the next question to ask until you understand the answer to the last question Standard SQL is limited by the Algebraic Set Theory basis of RDBMS, if you need Calculus then insert User Defined Functions into the SQL Programming models in conflict as Modeling and Simulation combine with Business Intelligence in Predictive Analytics

Operational vs. Analytical RDBMS Extraction /Cleansing Transformation Loading Reports OLTP Databases Processing Millions Of Transactions/sec OLAP Databases aka Data Warehouse Processing Hundreds of Terabytes/hour

Data Flow The SAN Bottleneck Applications Client $$$ Processing $$$ Fiber Channels $$$ Storage SMP HOST 1 $$$ C6 C12 F7 F13 C38 F39 G13 G22 SMP HOST 2 $$$ SMP HOST N Local Applications Hours or Days

The Netezza Idea: Moving Processing to the Data Active Disk architectures Integrated processing power and memory into disk units Scaled processing power as the dataset grew Decision support algorithms offloaded to Active Disks to support key decision support tasks Active Disk architectures use stream-based model ideal for the software architecture of relational databases Influenced by the success of Cisco and NetApp appliances, the approach combined software, processing, networking and storage leading to the first database warehouse appliance! Netezza is an IBM Company

The HW Architecture Responded to the Needs of the Database Application BI Applications Client ODBC 3.X JDBC Type 4 SQL/92 Netezza Performance Server Front End SQL Compiler Query Plan Optimize Admin The Database Application Execution Engine DBOS Gigabit Ethernet 1 2 3 Snippet Processing Unit (SPU) Processor + streaming DB logic Processor + streaming DB logic Processor + streaming DB logic High-Performance Database Engine Streaming joins, aggregations, sorts, etc. Local Applications Fast Loader/Unloader Linux SMP Host 1000+ Processor + streaming DB logic Massively Parallel Intelligent Storage Move processing to the data (maximum I/O to a single table)

Shifting from Analyst to Programmer Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop Files) Google, Yahoo and their friends are using a data-intensive application framework to analyze very large datasets (e.g.,weblogs) without transactions or structured data What will this look like in 10 years? MapReduce/Hadoop: an programming model/application framework performing group-by (map), sort and aggregation (reduce) Not queries, but programs willing to forgo the need for transactional integrity or the performance of structured data (4X-5X disadvantage on equal hardware, but with excellent scaling on cheap hardware) An increasingly popular approach with organizations that have the programming talent to use it, especially research organizations Another frontal assault on the $40K per socket RDBMS licensing

Analysis is Getting Complex Multi-dimensional predictive queries Connection networks, social network, Time, Space, Reasoning Find why the customers who live on the 25 th street in Zurich did not switch their phone company between March 23 to August 19 while their friends and family switched their phone company and predict the trend moving forward? Need to pick weak signals from the noise Data size will continue to grow and be more heterogeneous Finding needle in a hay stack will become more important Will require real time or shorter turnaround time Differentiation will be based on the speed of analysis of changing data INFERENCE maybe as a legitimate answer 18

Advancing from Generating Reports to Inference and Discovery Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based) The International W3C standards body has approved the key standards, called RDF and OWL to support the Semantic Web aka Web 3.0 with machine readable open linked data Future databases will use triples (subject-predicate-object) vs tables and with RDF/OWL federate heterogeneous data Future databases will support reasoning not just reporting This work started as a combined European Defense and DARPA effort Major RDBMS vendors are admitting Relational and XML are ill-suited to the needs of the semantic web of the future

The Idea: Address Memory and Network Latency Decision support algorithms offloaded to multithreaded processors and in-memory database with new complex key decision support tasks Supports new SPARQL query processing for RDF triples database offering the speedup of XMT processing without low level API Shared memory, multi-threaded Cray technology Integrated with semantic database (open source std compliant) Fastest complex query response on open linked data in the industry Influenced by the success of database warehouse appliances, our combined (database) software, processing, networking and memory led to the first graph appliance! Deliver easy to deploy solutions requiring knowledge discovery for Intelligence, Bioinformatics, Finance, etc.

Now a new tool for data analysis. Performs like a supercomputer Uses open web 3.0 standards? Operates like a data warehouse Scales like a web engine 21

Adapt the system to the application not the application to the systemvision: Exascale Cray s Adaptive Supercomputing combines multiple processing architectures into a single scalable system CPU, GPU, or Multithreaded The focus in on the user s application where the adaptive software, the compiler or query processor, knows what types of processors are available on the heterogeneous system and targets code to the most appropriate processor The next step is to evolve Adaptive Supercomputing to Big Data workloads

Enabling Simulation and Data Science Adaptive Supercomputing