Data Analytics at NERSC. Joaquin Correa JoaquinCorrea@lbl.gov NERSC Data and Analytics Services



Similar documents
NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

NERSC File Systems and How to Use Them

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data Requirements from NERSC Requirements Reviews

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Kriterien für ein PetaFlop System

How To Create A Data Visualization With Apache Spark And Zeppelin

Scaling Out With Apache Spark. DTL Meeting Slides based on

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Customer Case Study. Automatic Labs

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Shark Installation Guide Week 3 Report. Ankush Arora

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Ali Ghodsi Head of PM and Engineering Databricks

Building a Top500-class Supercomputing Cluster at LNS-BUAP

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Introduction Installation Comparison. Department of Computer Science, Yazd University. SageMath. A.Rahiminasab. October9, / 17

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Databricks. A Primer

Big Data Research in the AMPLab: BDAS and Beyond

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

Spark and the Big Data Library

Bringing Big Data Modelling into the Hands of Domain Experts

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Databricks. A Primer

Big Data Challenges in Bioinformatics

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Overview of HPC Resources at Vanderbilt

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

How Companies are! Using Spark

Quick Reference Selling Guide for Intel Lustre Solutions Overview

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic


Case Study : 3 different hadoop cluster deployments

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Big Data and Analytics: Challenges and Opportunities

TUT NoSQL Seminar (Oracle) Big Data

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

COMP9321 Web Application Engineering

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Mendel at NERSC: Multiple Workloads on a Single Linux Cluster

Hadoop Ecosystem B Y R A H I M A.

The Internet of Things and Big Data: Intro

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Enabling High performance Big Data platform with RDMA

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Unlocking the True Value of Hadoop with Open Data Science

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Overview of HPC systems and software available within

Bayesian networks - Time-series models - Apache Spark & Scala

Unified Big Data Analytics Pipeline. 连 城

New Storage System Solutions

Microsoft Research Windows Azure for Research Training

How To Scale Out Of A Nosql Database

Depth and Excluded Courses

Hadoop on the Gordon Data Intensive Cluster

Big Data Analysis: Apache Storm Perspective

Data Intensive Science and Computing

Brave New World: Hadoop vs. Spark

Parallel Programming Survey

Microsoft Research Microsoft Azure for Research Training

SGI High Performance Computing

Data Centric Systems (DCS)

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

How To Write A Trusted Analytics Platform (Tap)

Big Data Explained. An introduction to Big Data Science.

Architectures for massive data management

Making big data simple with Databricks

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Clusters: Mainstream Technology for CAE

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

What s next for the Berkeley Data Analytics Stack?

Unified Big Data Processing with Apache Spark. Matei

Large scale processing using Hadoop. Ján Vaňo

HPC Wales Skills Academy Course Catalogue 2015

HPCHadoop: MapReduce on Cray X-series

Introduction to ACENET Accelerating Discovery with Computational Research May, 2015

HPC technology and future architecture

ANALYTICS CENTER LEARNING PROGRAM

An Introduction to High Performance Computing in the Department

Architectures for Big Data Analytics A database perspective

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Transcription:

Data Analytics at NERSC Joaquin Correa JoaquinCorrea@lbl.gov NERSC Data and Analytics Services NERSC User Meeting August, 2015

Data analytics at NERSC Science Applications Climate, Cosmology, Kbase, Materials, BioImaging, Your science! Analytics Capabilities Statistics, Machine Learning Image Processing Graph Analytics Database Operations Tools + Libraries R, python, MLBase MATLAB OMERO, Fiji GraphX SQL Runtime Framework MPI Spark SciDB Resource Management Filesystems (Lustre), Batch/Queue Systems Hardware SandyBridge/KNL chipset, Burst Buffers, Aries Interconnect

Data analytics at NERSC Analytics Capabilities Statistics, Machine Learning Image Processing Graph Analytics Database Operations Tools + Libraries R, python, MLBase MATLAB OMERO, Fiji GraphX SQL Runtime Framework MPI Spark SciDB

Talk Overview Data analytics tools Data insight Scale your analysis -4-

Talk Overview Data analytics tools Data insight Scale your analysis -5-

R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical tools, such as linear and nonlinear modelling, classical statistical tests, timeseries analysis, classification, clustering, graphics, and it is highly extensible. -6-

MATLAB is a technical computing language that integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Toolboxes: MATLAB 16 Image Processing 2 Neural networks 1 Optimization 2 Parallel computing 2 Signal processing 1 Statistics 2 Compiler 1-7-

Mathematica Mathematica is a fully integrated environment for technical computing. It performs symbolic manipulation of equations, integrals, differential equations, and most other mathematical expressions. Numeric results can be evaluated as well. -8-

Fiji Is Just ImageJ - Fiji is an image processing package. It can be described as a "batteries-included" distribution of ImageJ (and ImageJ2), bundling Java, Java3D and a lot of plugins organized into a coherent structure. -9-

- 10 -

Scientific Computing Tools for Python NumPy The SciPy library Matplotlib pandas SymPy IPython nose Cython Scikits h5py mpi4py - 11 -

Deep learning at NERSC neon is an easy to use, python-based scalable Deep Learning library. Deep Learning has recently achieved state-of-the-art performance in a wide range of domains including images, speech, and text. It is seeing adoption in the HPC community as a tool for large-scale data processing. - 12 -

Talk Overview Data analytics tools Data insight Scale your analysis - 13 -

R and RStudio Service(Beta) r.nersc.gov

ipythonnotebook Service(Beta) ipython.nersc.gov

Talk Overview Data analytics tools Data insight Scale your analysis - 16 -

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. - 17 -

Easy to Prototype Interactive shells make it easy to explore your data Interactively debug your analyses Works with ipython notebooks Easy to Run High-Level API using map-reduce paradigm Implement all your analyses in a few lines of Python Scala Java - 18 -

Computation Type Spark Implementation Machine Learning MLlib, Spark ML Graph Computations GraphX Database Operations Spark SQL Streaming Analysis Spark Streaming Your Own Custom Analysis Using Spark s Built In Functions Computation types can be combined seamlessly all in the same piece of code! - 19 -

SciDB For High Usability Big Data Analytic Why? It s painful to manage and analyze terabytes of data. Need a unified solution that s easy to use. What? SciDB is a parallel database for array-structured data, great for Terabytes of: Time series, spectrums, imaging, etc The greatest benefit of SciDB is: Usability: Use HPC hardware without learning parallel programming and parallel I/O. - 20 - SciDB Distribute a big array on many nodes

NERSC Data Analytic Services Production Data Services Big and Diverse Computing Facility 6000+ Users, 700+ Projects 3+ PetaFlops (20+pf more coming) 50+ PB Storage - 21 - Science Engagement

Thank you. - 22 -

Cori: Unified architecture for HPC and Big Data 64 Cabinets of Cray XC System 50 cabinets Knights Landing manycore compute nodes 10 cabinets Haswell compute nodes for data partition ~4 cabinets of Burst Buffer 14 external login nodes Aries Interconnect (same as on Edison) Lustre File system 28 PB capacity, 432 GB/sec peak performance NVRAM Burst Buffer for I/O acceleration Significant Intel and Cray application transition support Delivery in mid-2016; installation in new LBNL CRT - 23 -

NERSC Systems Hopper: 1.3PF, 212 TB RAM 2.2 PB Local Scratch 70 GB/s Cray XE6, 150K Cores Edison: 2.5PF, 357 TB RAM 48 GB/s 6.4 PB Local Scratch 140 GB/s 80 GB/s 80 GB/s /global/scr atch 4 PB 50 GB/s /project 5 PB 5 GB/s /home 12 GB/s HPSS Cray XC30, 130K Cores Sponsored Compute Systems Carver, PDSF, JGI, KBASE, HEP 8 x FDR IB Ethernet & IB Fabric Science Friendly Security Production Monitoring Vis & Analytics, Data Transfer Nodes, Adv. Arch., Science Gateways 2 x 10 Gb 1 x 100 Gb Power Efficiency WAN - 24 - Science Data Network 250 TB 45 PB stored, 240 PB capacity, 40 years of community data

5 V s of Scientific Big Data Science Domain Variety Volume Velocity Veracity Astronomy Multiple Telescopes, multi-band/spectra O(100) TB 100 GB/night 10 TB/night Noisy, acquisition artefacts Light Sources Multiple imaging modalities O(100) GB 1 Gb/s-1 Tb/s Noisy, sample preparation/acquisition artefacts Genomics Sequencers, Massspec, proteomics O(1-10) TB TB/day Missing data, errors HEP: LHC, Daya Bay Multiple detectors O(100) TB O(10) PB 1-10 PB/s reduced to GB/s Noisy, artefacts, spatiotemporal Climate Simulations Multi-variate, spatio-temporal O(10) TB 30-70 GB/s Clean, need to account for multiple sources of uncertainty - 25 -

DOE Facilities are Facing a Data Deluge Genomics Climate Astronomy Physics Light Sources - 26 -