NASA's Strategy and Activities in Server Side Analytics

Similar documents
NASA s Big Data Challenges in Climate Science

High Performance Science Cloud! Meeting the Big Data Challenges of Climate Science!

ABoVE Science Cloud: An orientation for the ABoVE Science Team

Performance Analysis of a Numerical Weather Prediction Application in Microsoft Azure

SR-IOV In High Performance Computing

Data Semantics Aware Cloud for High Performance Analytics

Big Data and Cloud Computing for GHRSST

The ORIENTGATE data platform

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

Microsoft Research Windows Azure for Research Training

Microsoft Research Microsoft Azure for Research Training

Ganzheitliches Datenmanagement

Applying Apache Hadoop to NASA s Big Climate Data!

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Migrating NASA Archives to Disk: Challenges and Opportunities. NASA Langley Research Center Chris Harris June 2, 2015

Cluster, Grid, Cloud Concepts

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

JASMIN Cloud ESGF and UV- CDAT Conference December 2014 STFC / Stephen Kill

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

XSEDE Data Analytics Use Cases

A standards-based open source processing chain for ocean modeling in the GEOSS Architecture Implementation Pilot Phase 8 (AIP-8)

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

Getting Started Hacking on OpenNebula

Enabling Science in the Cloud: A Remote Sensing Data Processing Service for Environmental Science Analysis

Implementing Microsoft Azure Infrastructure Solutions

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Manjrasoft Market Oriented Cloud Computing Platform

ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P.

Intro to Virtualization

Managing Traditional Workloads Together with Cloud Computing Workloads

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

NCDC Strategic Vision

SCIENCE DATA ANALYSIS ON THE CLOUD

On Demand Satellite Image Processing

Parallel Analysis and Visualization on Cray Compute Node Linux

Exploration of adaptive network transfer for 100 Gbps networks Climate100: Scaling the Earth System Grid to 100Gbps Network

OpenNebula An Innovative Open Source Toolkit for Building Cloud Solutions

The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team

The Impact of PaaS on Business Transformation

Cloud JPL Science Data Systems

Manjrasoft Market Oriented Cloud Computing Platform

PART 1. Representations of atmospheric phenomena

w w w. u l t i m u m t e c h n o l o g i e s. c o m Infrastructure-as-a-Service on the OpenStack platform

OpenNebula Cloud Platform for Data Center Virtualization

IBM Cloud Security Draft for Discussion September 12, IBM Corporation

HYCOM Meeting. Tallahassee, FL

Parabon Frontier. Enterprise Computing Platform. Featuring Frontier CODE. Collaborative Online Development Environment

Oracle Big Data SQL Technical Update

Implementing Microsoft Azure Infrastructure Solutions 20533B; 5 Days, Instructor-led

André Karpištšenko, Co-Founder & Chief Scientist, Marinexplore Strata,

Course 20533B: Implementing Microsoft Azure Infrastructure Solutions

Supercomputing on Windows. Microsoft (Thailand) Limited

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

CMIP6 Data Management at DKRZ

Powered by VCL - Using Virtual Computing Laboratory (VCL) Technology to Power Cloud Computing

StormSurgeViz: A Visualization and Analysis Application for Distributed ADCIRC-based Coastal Storm Surge, Inundation, and Wave Modeling

Virtual Climate Data Server - A Case Study

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Vision. South Pacific GIS/RS Conference /17/2015. Applying Geography Everywhere. Applying Geography Everywhere

Appendix to; Assessing Systemic Risk to Cloud Computing Technology as Complex Interconnected Systems of Systems

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

CLOUD TECH SOLUTION AT INTEL INFORMATION TECHNOLOGY ICApp Platform as a Service

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

The Greenplum Analytics Workbench

SURFsara HPC Cloud Workshop

Big Data Trends and HDFS Evolution

San Jose State University

How To Write An Nccwsc/Csc Data Management Plan

Taking the cloud to your datacenter

Software. Enabling Technologies for the 3D Clouds. Paolo Maggi R&D Manager

REACCH PNA Data Management Plan

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

The path to the cloud training

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Grid vs. Cloud Computing

OpenNebula Open Souce Solution for DC Virtualization

THE EUCALYPTUS OPEN-SOURCE PRIVATE CLOUD

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Matsu Workflow: Web Tiles and Analytics over MapReduce for Multispectral and Hyperspectral Images

OpenStack Introduction. November 4, 2015

2) Xen Hypervisor 3) UEC

Norwegian Satellite Earth Observation Database for Marine and Polar Research USE CASES

Share and aggregate GPUs in your cluster. F. Silla Technical University of Valencia Spain

HPC-related R&D in 863 Program

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Stanislav Ulrych CTO RED HAT ENTERPRISE LINUX OPENSTACK PLATFORM

Big Data and Advanced Analytics Technologies for the Smart Grid

Cloud Computing through Virtualization and HPC technologies

HPC Cloud Computing Guide PENGUIN ( ) HPC

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

CLOUD COMPUTING. When It's smarter to rent than to buy

The distribution of marine OpenData via distributed data networks and Web APIs. The example of ERDDAP, the message broker and data mediator from NOAA

OpenNebula Open Souce Solution for DC Virtualization

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

IDL. Get the answers you need from your data. IDL

SURFsara HPC Cloud Workshop

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

Relational Databases in the Cloud

Transcription:

NASA's Strategy and Activities in Server Side Analytics Tsengdar Lee, Ph.D. High-end Computing Program Manager NASA Headquarters Presented at the ESGF/UVCDAT Conference Lawrence Livermore National Laboratory December 9, 2014 1

The Challenges ESGF is Facing SCIENTIFIC! Documenting the status and behavior of the Earth system and its multiple, interacting components! Documenting the evolution of the Earth system and providing understanding of the sources of that evolution! Supporting the projection of the future evolution of the Earth system! Making Earth system science data easily available to users for both scientific and societal purposes GROWNING DATA VOLUME!! Satellite Data: consider a global imager with 250 m resolution measuring once per day at 30 wavelengths for a year - ~ 10 14 pixels/year!! Model Output: consider a chemistry/climate model, with 1 o x1 o resolution and 50 layers, writing out 30 parameters at hourly intervals for a year - ~ 10 12 results written/year COMMUNITY! Research Community: scientific researchers looking to answer fundamental questions about the Earth! Assessment Community: researchers of all types looking to document information about prior and future evolution of the Earth system to inform long-term policy and decision making! Forecasting Community: operational scientists and others looking to provide forecasts to the general public! Applications Community: research, corporate, and nongovernmental organizations looking to inform nearer term decisions for management and planning 2 2

Typical ESGF Data Analysis and Data Processing Work Loads! A scientist or engineer queries a metadata server for the data and orders the data from a data center.! The data center fulfills the order by preparing (subsetting, resampling, averaging etc.) the data and puts the result on a FTP server.! After receiving a notification from the data center, the investigator goes to the FTP server and fetches the data.! Data is transmitted to the investigator s institution and stored on a local storage.! The investigator processes and analyzes the data locally using local computing resources.! Some of the processed data will have to transmitted back to the data center. 3

7-km GEOS-5 Nature Run Global Tropical Cyclones! Nature run is used in Observation System Simulation Experiment. This GEOS-5 Nature Run successfully reproduces typical tropical cyclone activity in all basins including a large number of weak tropical storms as well as major hurricanes and typhoons.! This 2 years simulation was done with 7200 cores for 75 days and generated 2 PB of simulation output. Tropical Storm winds 39-73 mph Hurricane winds 74-111 mph Major Hurricane winds 112+ mph In this short period from September 7-12, 2006 during the GEOS-5 7-km Nature Run, two hurricanes spin through the east Pacific basin while a major Atlantic hurricane develops in the Gulf of Mexico making landfall along the US Gulf coast as a category 3 hurricane with winds (color shading) in excess of 110 mph. 4

Use cases for Scientific Information System "!Scientists and engineers often use computing services to perform data analysis, theory verification, and predictions. "!Often move large volume of data to and from data centers and to and from compute centers. "!Need to communicate, collaborate, and share data with external (e.g. university) investigators. "!Require high speed connections and high speed computing platforms beyond business administration requirements for transferring large files. "!Require local disk storage and visualization HW and SW. 5

Analogy and Challenges Analogy: Challenges:! Stewardship! Curation! Indexing! Cataloging! Searching! Ordering! Subsetting! Provenance! Lineage! Data Mining! Dissemination 6

Functional Architecture 7

Challenges in Server Side Challenges:! Remote and local data visualization! Server side processing capacity! Data Mining! Machine Learning! Distributed data analysis! Data on-boarding! ETL! High speed network! Data management! Data storage Analytics 8

Notional Architecture Collaborative Workbench to Accelerate Science Algorithm Development CUPID: An IDE for Model Development and Modeler Training A Community-Driven Workflow Recommendations and Reuse Infrastructure Parallel Web-Service Climate Model Diagnostic Analyzer Integrating Parallel and Distributed Data Mining Algorithms into NASA Earth Exchange (NEX) Next Generation Cyberinfrastructure to Support Comparison of Satellite Observations with Climate Models ESG MERRA Analytic Services irod OpenDA P Hadoop Cloud Enabled Scientific Collaborative Research Environment (CESCRE) Accelerating Earthquake Simulations on General- Purpose Graphics Processors Data Storage 9

ABoVE Science Cloud" 10

ABoVE Science Cloud" "!Make use of the NCCS High Performance Science Cloud (HPSC)"!Unified Data Analysis Platform that provides a colocation of data, compute, data management, and data services"!low barrier to entry for scientists; customized run time environments; agile environment" "!Joint activity of the CCE, CISTO, and the NCCS"!High performance data and compute for ABoVE scientists, algorithms, models, observations, analytics" "!Data storage surrounded by a compute cloud"!large amount of data storage, high performance compute capabilities, very high speed interconnects" High Performance Compute Cloud Data Cloud Agile Compute Processing Persistent Data Services 11 11

User Software Stack" Software Package Comments Operating System - Linux Operating System - Windows Open Source Proprietary NCCS standard support May require licenses for long term support IDL and Matlab Proprietary Will require licenses; how many and what packages are required? ArcGIS Desktop, Server, Portal R, Python, HDF (4 and 5), GDAL, Geotiff, NetCDF Data Management - Ramadda Data Management irods Proprietary Open Source Open Source Open Source NASA wide license NCCS standard support based on scientists needs Could use the science cloud as a platform to perform an analysis of alternatives for data management NCCS supported platform ABoVE Science Cloud 12 12

Example Software Stacks and Responsibilities" Application Custom Science App Application Persistent Data Service Commercial IDL, Matlab Commercial IDL, Matlab Libraries/Tools e.g. R, Python, HDF, GDAL, Geotiff, NetCDF Libraries/Tools e.g. R, Python, HDF, GDAL, Geotiff, NetCDF File System RedHat Gluster File System RedHat Gluster Interconnect OpenIB and TCP/IP Interconnect OpenIB and TCP/IP Guest VM Linux (Debian, Centos, Windows) Guest VM Linux (Debian, Centos, Windows) File System RedHat Gluster Responsibilities Hypervisor KVM Scientists Internal Interconnect Open Fabrics Infiniband (OpenIB) NCCS Operating System CentOS (Linux) ABoVE Science Cloud 13 13

Future Directions and Challenges "! Scale with Big Data produced by higher resolution models, satellites, and instruments "! Expand server-side functionality "! Server-side processing through WPS (climate indexes, custom algorithms); GIS mapping services (for climate change impact studies at regional and local scale); Facilitate model to observations inter-comparison "! Due to the distributed data centers, new algorithms will be necessary when there is only partial sample at each of the data centers "! Expand direct client access capabilities "! Increased support for remote data access; Track provenance of complex processing workflows for reproducibility and repeatability "! Package VMs for Cloud deployment "! Instantiate data grid nodes on demand for short lifetime projects; Environment with elastic allocation of back-end storage and computing resources 14

Open Source Strategy 15

Thought about ESGF Governance "!If it is a federation, it needs to be governed by the members of the federation. "!It needs to allow autonomous processes independent of funding agencies. "!Current governance model is good in the VERY near term. "!The ESGF community and the funding agencies need to think about how this may be hand off to the community. "!We should probably take a look of some hybrid models (e.g. ESIP). 16

Final Thoughts "!ESGF needs to co-exist with other systems. "!WGCM needs to recognize there are other communities out there. "!We need to figure out how to leverage public cloud computing architecture. 17

Thank You! Tsengdar Lee, Ph.D. High-end Computing Program Manager Weather Focus Area Program Scientist NASA Headquarters tsengdar.lee@nasa.gov 18