NERSC REQUIREMENTS FOR ENABLING CS RESEARCH IN EXASCALE DATA ANALYTICS

Similar documents
On the Path to Sustainable, Scalable, and Energy-efficient Data Analytics: Challenges, Promises, and Future Directions

Elements of Scalable Data Analysis and Visualization

Hank Childs, University of Oregon

Big Data, Exascale Systems and Knowledge Discovery The Next Frontier for HPC

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

HPC ABDS: The Case for an Integrating Apache Big Data Stack

The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams

Big Data's Biggest Needs- Deep Analytics for Actionable Insights

Scalable Developments for Big Data Analytics in Remote Sensing

Big-data Analytics: Challenges and Opportunities

Protein Protein Interaction Networks

Cloud+X: Exploring Asynchronous Concurrent Applications

Experiences of numerical simulations on a PC cluster Antti Vanne December 11, 2002

Data Requirements from NERSC Requirements Reviews

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Jean-Pierre Panziera Teratec 2011

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Big Data and Clouds: Challenges and Opportuni5es

Bootstrapping Big Data

Data Centric Systems (DCS)

Advanced Computational Software

Appendix 1 ExaRD Detailed Technical Descriptions

One Statistician s Perspectives on Statistics and "Big Data" Analytics

Big Data in HPC Applications and Programming Abstractions. Saba Sehrish Oct 3, 2012

Make the Most of Big Data to Drive Innovation Through Reseach

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Advanced Big Data Analytics with R and Hadoop

OS/Run'me and Execu'on Time Produc'vity

Big Data and Scientific Discovery

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Science DMZs Understanding their role in high-performance data transfers

Exploration of adaptive network transfer for 100 Gbps networks Climate100: Scaling the Earth System Grid to 100Gbps Network

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Application Performance NERSC. David Skinner, Richard Gerber, Nick Wright, Karl Fuerlinger and 4000 others

Architectures for Big Data Analytics A database perspective

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Statistical Learning Theory Meets Big Data

The Ultimate in Scale-Out Storage for HPC and Big Data

GURLS: A Least Squares Library for Supervised Learning

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Cross Validation. Dr. Thomas Jensen Expedia.com

Chapter 6. The stacking ensemble approach

Sanjeev Kumar. contribute

GridSolve: : A Seamless Bridge Between the Standard Programming Interfaces and Remote Resources

Data Storage and Data Transfer. Dan Hitchcock Acting Associate Director, Advanced Scientific Computing Research

Mathematical Libraries on JUQUEEN. JSC Training Course

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

RevoScaleR Speed and Scalability

HPCOR Data Management Policies Co-Chairs: Bill Allcock, Julia White. HPCOR 2014, June 18-19, Oakland, CA 1

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Big + Fast + Safe + Simple = Lowest Technical Risk

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

Easier - Faster - Better

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Power-Aware High-Performance Scientific Computing

Statistical Distributions in Astronomy

Information Processing, Big Data, and the Cloud

Introduction. 1.1 Motivation. Chapter 1

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Artificial Neural Network and Non-Linear Regression: A Comparative Study

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

ASCR Program Response to the Report of the ASCAC Committee of Visitors Review of the Computer Science Program

Part I Courses Syllabus

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Hue Streams. Seismic Compression Technology. Years of my life were wasted waiting for data loading and copying

Client-aware Cloud Storage

Analysis Tools and Libraries for BigData

Data Warehouse design

Visualization and Data Analysis

Large-Scale Test Mining

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

The Scientific Data Mining Process

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

IBM Netezza High Capacity Appliance

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

High Productivity Data Processing Analytics Methods with Applications

Accelerating CFD using OpenFOAM with GPUs

Achieving Performance Isolation with Lightweight Co-Kernels

Machine Learning for Data-Driven Discovery

pr: Introduction to Parallel R for Statistical Computing for people with little knowledge but infinite intelligence

TECHNICAL BRIEF. Primary Storage Compression with Storage Foundation 6.0

In-Memory Data Management for Enterprise Applications

Local classification and local likelihoods

Inge Os Sales Consulting Manager Oracle Norway

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Benchmarking for High Performance Systems and Applications. Erich Strohmaier NERSC/LBNL

Data Science at U of U

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Data Mining and Pattern Recognition for Large-Scale Scientific Data

Optimizing Data Mining Workloads using Hardware Accelerators

Transcription:

NERSC REQUIREMENTS FOR ENABLING CS RESEARCH IN EXASCALE DATA ANALYTICS Nagiza F. Samatova samatovan@ornl.gov Oak Ridge Na5onal Laboratory North Carolina State University January 5, 2011

ACKNOWLEDGEMENTS DOE ASCR for funding CS research in exascale data analytics Arie Shoshani, Alok Choudhary, Rob Ross, etc PI s and co-pi s on the projects LCF Facilities at ORNL Application Scientists: CS Chang, ORNL Stephane, PPPL Fred Semazzi, NCSU Many others January 5, 2011 Nagiza F. Samatova 2

HARDWARE FOR DATA ANALYTICS IS A FOSTER CHILD HW configuration for DA is an after-thought: Has been traditionally optimized for running simulations Whatever is left over is what data analyst should live with DA-driven HW must become the first class citizen on the agenda if we are serious about the exascale Infrastructure depends on the DA modality: In situ? Distributed or streamline fashion? Local or global context analysis? Shared among a group of collaborators? Linked to experimental and/or other data archives? January 5, 2011 Nagiza F. Samatova 3

In contrast to simulations, Data Analytics requires a different mix of memory, disk storage, & communication trade-offs.

Twice a month producnon runs Archive Data (~2TB) (grows exponennally): - - 100MB files - - 100-1000s scans in each file SEQUEST 14-24 hours per file One scan access at a Nme Results (~2TB)

DOMAIN- SPECIFIC REALIZATION OF THE SW STACK Plug-ins January 5, 2011 Nagiza F. Samatova 7

END- TO- END DATA ANALYTICS SOFTWARE STACK IS COMPLEX: GENERIC (ALL APPLICATIONS) PERSPECTIVE Domain Applica5on Layer Biology Climate Fusion Exemplars CDAT Interface Layer Dashboard Web Service Workflow Kepler Automa5c Paralleliza5on Middleware Layer Scheduling Plug- in pr Focus of my talk Analy5cs Core Library Layer Parallel Distributed Streamline Data Movement, Storage, Access Layer Data Mover Light Parallel I/O Indexing RScaLAPACK FastBit pnetcdf

THE LESSON LEARNED FROM LINEAR ALGEBRA Dwarfs: Sparse, Dense, etc. ScaLAPACK PETSc etc Pre-conditioner Solver conditions a given problem into a form that is more suitable for numerical solution. highly optimized computational kernel January 5, 2011 9

SOFTWARE FOR DATA ANALYTICS IS MORE AD HOC Should we adopt this approach from Linear Algebra to Data Analytics at extreme scale? If so, then What are the Dwarfs for data analytics? What about the Preconditioners? What are the Computational kernels? Do we/should we have a ScaLAPACK-like library for Exascale Data Analytics? What NERSC should/could offer for enabling this activity? January 5, 2011 Nagiza F. Samatova 10

COMPUTATIONAL KERNELS CONCEPT IS PROMISING The frequency of kernel operations in illustrative data mining algorithms and applications. ApplicaNon Top 3 Kernels (%) Kernel 1 (%) Kernel 2 (%) Kernel 3 (%) Sum % K- means Distance (68) Center (21) mindist (10) 99 Fuzzy K- means Center (58) Distance (39) fuzzysum (1) 98 BIRCH Distance (54) Variance (22) redist.(10) 86 HOP Density (39) Search (30) Gather (23) 92 Naïve Bayesian probcal (49) Variance (38) dataread (10) 97 ScalParC Classify (37) ginicalc (36) Compare (24) 97 Apriori Subset (58) dataread (14) Increment (8) 80 Eclat Intersect (39) addclass (23) invertc (10) 72 SVMlight quotmatrix(57) quadgrad (38) quotupdate(2) 97 Alok Choudhary, NWU, NU- Minebench January 5, 2011 Nagiza F. Samatova 11

WHAT ABOUT PRECONDITIONERS FOR DATA ANALYTICS? How to define a preconditioner for data analytics? Solve a Problem P hard Directly Indirectly (via Preconditioner ): Reduce a Hard Problem P hard to a Better Problem P better Better in terms of: Increased throughput Faster time-to-solution More accurate solution Higher data compression rate Approximate but real-time solution 12

IF WE ARE LUCKY and Jack Dongara did most of the work for us: Some data analysis routines call linear algebra functions In R, they are built on top of LAPACK library RScaLAPACK is an R wrapper library to ScaLAPACK library (RScaLAPACK) sla.solve (A,b) sla.svd (A) sla.prcomp (A) A = matrix(rnorm(256),16,16) b = as.vector(rnorm(16)) Using RScaLAPACK: Using R: solve (A,b) La.svd (A) prcomp (A) January 5, 2011 Nagiza F. Samatova 13 WHAT IF WE ARE NOT THAT LUCKY?

IN SITU PRECONDITIONERS FOR SCIENTIFIC DATA COMPRESSION Myth: Scientific data is almost uncompressible. GTS Fusion Simulation Data (Stephane, PPPL) C&R Data Analysis Data V&V Data ~2TB per C&R Every 1 hour Two copies Keep the last copy ~2TB per run (now) Every 10 th time step Cannot afford storing all b/s of Analysis routines and I/O reads Matlab analysis routines Small Every 2 nd time step Expected: 10-fold increase by 2012-2014 January 5, 2011 Nagiza F. Samatova 14

January 5, FROM 2011 C.S. CHANG S TALK Nagiza F. AT Samatova NERSC 15

S- PRECONDITIONER FOR ANALYSIS DATA COMPRESSION Analysis data is stored every N-th time step: Lossy data reduction Data is almost random hard/impossible to compress; <10% lossless N is defined ad hoc (N=10 for GTS, N=100 for Supernova) Pearson Correlation 1.2 1 0.8 0.6 0.4 0.2 0 CorrelaNon vs. Compression RaNo (GTS Analysis Data) S- precondinoned Wavelet Compression 87.0% 87.1% 87.2% 87.3% 87.4% 87.5% Compression Ratio Stephane, PPPL: With this data quality and data reducnon rate, I can test many more hypothesis using my analysis tools. While Compression Ratio is growing up from 87.06% to 87.44%, the Pearson Correlation dropped from 0.994 to 0.937. PhD students: Ye Jin and Sriram Lakshminarasimhan January 5, 2011 Nagiza F. Samatova 16

BFA- PRECONDITIONER FOR C&R DATA COMPRESSION C&R Data Compression: Must be lossless Must be fast Impact of BFA-preconditioner: 8x throughput increase for bzip 4x throughput increase for gzip 1.41 compression ratio (CR) for zpaq with BFA-precond 1.33 vs. 1.17 CR for bzip2 with vs. w/o BFA-precond. 1.32 vs. 1.19 CR for zlib with vs. w/o BFA-precond. January 5, 2011 PhD student: Eric Nagiza Schendel F. Samatova and Undergrad: Nel Shah 17

BC- PRECONDITIONER FOR UNDERDETRMINED CLASSIFICATION PROBLEM (BENCH) Bayesian Belief Network (BBN) Accuracy increase by 13%-16% Across different classifiers On data with <100 samples >d=4,000-7,000 dimensions underdetermined problems When applied to seasonal hurricane prediction (d>35k), correlation with observed improved from 0.64 to 0.92-0.96 PhD students: January 5, 2011 Ann Pansombut, William Nagiza F. Samatova Hendrix, Zhengzhang Chen, Doel 18 Gonzalez

SE- PRECONDITIONER FOR CONTRASTING FREQUENT SUBGRAPH MINING (NIBBS- SEARCH) Exact Algorithm versus NIBBS- Search (98 Genome- Scale Metabolic Networks, 49 PosiNve, 49 NegaNve) Run5me of exact algorithm grows exponen5ally (unable to complete run) The NIBBS- Search algorithm completes in a maher of seconds Empirical tests show that the NIBBS- Search subgraphs are significantly close approxima5ons of maximally- biased subgraphs January 5, 2011 Nagiza F. Samatova 19 PhD students: Mam Schmidt, Andrea Rocha

DARK FERMENTATIVE BIO- HYDROGEN PRODUCTION PATHWAYS ARE IDENTIFIED WITH NIBBS- SEARCH PhD January students: 5, 2011 Andrea Rocha Nagiza and F. Kanchana; Samatova Undergrad: KaNe Shpanskaya 20

CS DATA ANALYSIS RESEARCH WORKFLOW ITERATIVE PROCESS W/ SIGNIFICANT RESOURCE NEEDS User (e.g., Stephane, PPPL) Problem Sample Data GBs Generate Idea(s) Prototype Evaluate R, pr, Matlab, Python Performance Tools, Cross-Validation HPC Implementation MPI, CUDA-GPU, etc. Many data sets (from different Apps) SUPER-production runs (~200 TBs) Many parameter sensitivity analysis runs Publish Put into production