Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Similar documents
Zenith: a hybrid P2P/cloud for Big Scientific Data

Big Data Explained. An introduction to Big Data Science.

Oracle Big Data SQL Technical Update

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Spatio-Temporal Networks:

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

How To Scale Out Of A Nosql Database

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data Analytics Best Practices

How To Handle Big Data With A Data Scientist

Data Centric Systems (DCS)

Chapter 7. Using Hadoop Cluster and MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Oracle Big Data Strategy Simplified Infrastrcuture

Integrating a Big Data Platform into Government:

Hadoop implementation of MapReduce computational model. Ján Vaňo

HPC technology and future architecture

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Big Data Challenges in Bioinformatics

Hadoop and Map-Reduce. Swati Gore

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Challenges for Data Driven Systems

Big Data and Your Data Warehouse Philip Russom

Integrating Big Data into the Computing Curricula

Big Data a threat or a chance?

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Big Data Technologies Compared June 2014

The Future of Data Management

Data Refinery with Big Data Aspects

Cloud & Big Data a perfect marriage? Patrick Valduriez

NoSQL for SQL Professionals William McKnight

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data and Analytics in Government

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Applications for Big Data Analytics

Are You Ready for Big Data?

An Oracle White Paper June Oracle: Big Data for the Enterprise

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Testing 3Vs (Volume, Variety and Velocity) of Big Data

The future: Big Data, IoT, VR, AR. Leif Granholm Tekla / Trimble buildings Senior Vice President / BIM Ambassador

An Oracle White Paper October Oracle: Big Data for the Enterprise

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Reference Architecture, Requirements, Gaps, Roles

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large scale processing using Hadoop. Ján Vaňo

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Information Processing, Big Data, and the Cloud

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Big Data: Are You Ready? Kevin Lancaster

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

SAP HANA Vora : Gain Contextual Awareness for a Smarter Digital Enterprise

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Hadoop. Sunday, November 25, 12

Evaluating MapReduce and Hadoop for Science

Big Data and Analytics: Challenges and Opportunities

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce with Apache Hadoop Analysing Big Data

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

On a Hadoop-based Analytics Service System

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Oracle Database 12c Plug In. Switch On. Get SMART.

The 4 Pillars of Technosoft s Big Data Practice

Sanjeev Kumar. contribute

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

COMP9321 Web Application Engineering

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA AND ANALYTICS

Open source large scale distributed data management with Google s MapReduce and Bigtable

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

This Symposium brought to you by

Advanced In-Database Analytics

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Testing Big data is one of the biggest

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Cray: Enabling Real-Time Discovery in Big Data

Cloud Scale Distributed Data Storage. Jürmo Mehine

Transcription:

Data-intensive HPC: opportunities and challenges Patrick Valduriez

Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop Keeps evolving 2

Big Data Killer App.: Analytics Objective: find useful information and discover knowledge in data Typical uses: forecasting, decision making, research, science, Techniques: data analysis, data mining, machine learning, Why is this hard? External data from various sources Hard to verify and assess, hard to integrate Low information density (unlike in corporate data) Like searching for needles in a haystack Different structures Unstructured text, semi-structured document, key/value, table, array, graph, stream, time series, etc. Hard to integrate Simple machine learning models don't work Reality and context matter See "When big data goes bad" stories 3

Data Management: basic principle Data independence Hides implementation details Provision for advanced services and sharing Schema and metadata Queries (SQL) Optimization (algebra) Automatic parallelization Indexes Transactions Consistency Access control Application Generic techniques => Data Logical View User productivity and Economy of scale Application Data 4

Parallel data processing Inter-query (task parallelism) Different queries on the same data set (or data stream) For lots of concurrent queries e.g. Google search Inter-operation (task parallelism) Different operations of the same query on different data sets For complex queries e.g. RDBMS Intra-operation (data parallelism) The same operation on different data partitions of the same data set For ops on big data sets e.g. MapReduce Q 1 Q n R Op 3 Op 1 Op 2 R S Op Op R 1 R n 5

(Structured) Database Systems Volume Parallel database systems E.g. Oracle Exadata database machine Velocity Data stream management systems E.g. Streambase Variety Data integration systems E.g. DB2 Information Integrator 6

HPC: from simulation to big data analytics (BDA) Simulation: traditional HPC with big data e.g. weather forecasting, climate modeling HPC applications need to deal with more and more data More input data, e.g. from more powerful scientific instruments or sensor networks More output data, e.g. with smarter mathematical models and algorithms Analytics: a newer, complementary market for HPC Analytics methods applied to established HPC domains in industry, government, academia High-end commercial analytics pushing up into HPC Shorter journey from science to business, e.g. from cancer genomics to personalized medicine 7

Some BDA Killer Apps Real-time processing and analysis of raw data from highthroughput scientific instruments Uncertainty quantification in data, models, and experiments E.g. to measure the reliability of simulations involving complex numerical models Fraud detection across massive databases Applicable in many domains (e-commerce, banking, telephony, etc.) National security Signal intelligence (SIGINT), anomaly detection, cyber analytics Anti-terrorism (including evacuation planning), anti-crime Health care/medical science Drug design, personalized medicine Epidemiology Systems biology Social network analysis Modeling, simulation, visualization of large-scale networks 8

Cyber Analytics* Goal: identify malicious activity in high-throughput streaming data More than 10 billion transactions/day Tens of millions of unique IP addresses observed each month Adjacency matrix may contain over a quadrillion elements but is sparse, with billions of values Tens of TBs to PBs of raw data Patterns can span seconds, months Current data analysis tools operate on thousands to hundreds of thousands of records * J.R. Johnson. High Performance Computing for Data Intensive Science. 2014. 9

Risers Fatigue Analysis (RFA)* Context: joint project between Zenith, UFRJ and LNCC, Rio de Janeiro (and Petrobras) Pumping oil from ultra-deep water from thousand meters up to the surface through risers Problems RFA requires a complex workflow of data-intensive activities which may take like 40 hours (SGI Altix ICE 8200 with 64 CPUs Quad Core cluster) Input: 1,000 files (about 1 GB) Riser information, such as finite element meshes, winds, waves and sea currents, and case studies Output: 6,000 files (about 100 GB) 10 activities (dynamic analysis of risers movements, tension analysis, curvature analysis, merging data from previous activities, etc.) An algebraic approach (inspired by relational algebra) and a parallel execution model 4-fold improvement versus hard-coded parallel workflow * E. Ogasarawa, J. Dias, D. Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-Centric Scientific Workflows. VLDB 2011. 10

BDA vs. HPC Computing model Data storage Parallel file management Programming model BDA Data-centric: move tasks to data and Reduce Uniform storage (sharding) on disks Designed for few big files, e.g. HDFS Algebraic operators, e.g. MapReduce, Spark Languages Java, Python, C++ C, C++ HPC Compute-centric: move data to tasks and Accelerate Hierarchical storage (disks, tapes, etc.) Designed for many small files, e.g. Lustre MPI versus OpenMP Opportunities HPC: smarter file systems, with metadata and indexes; highlevel programming frameworks, e.g. MapReduce-MPI BDA: compute-centric nodes to run heavy tasks (machine learning) 11

Challenges for Data-intensive HPC Data-intensive HPC = HPC + BDA Collaboration from two big research communities Data management and HPC Is it a billion$ market? Some hard problems Architecture: how does it fit with IT, data centers and cloud? No one-size-fits-all solution Data storage: how to combine uniform and hierarchical storage models? Data movement between applications and tasks HPC data-oriented programming frameworks Tools for data cleaning, data integration, data analysis 12

Thanks