Big Data Research at DKRZ

Similar documents
Big Data Services at DKRZ

CMIP5 Data Management CAS2K13

CMIP6 Data Management at DKRZ

High Performance Computing OpenStack Options. September 22, 2015

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

BENCHMARKING V ISUALIZATION TOOL

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering


Data management and archiving

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Interactive Data Visualization with Focus on Climate Research

NASA s Big Data Challenges in Climate Science

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Flexible Scalable Hardware independent. Solutions for Long Term Archiving

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

NetApp High-Performance Computing Solution for Lustre: Solution Guide

NextGen Infrastructure for Big DATA Analytics.

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Data Requirements from NERSC Requirements Reviews

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Polish National Data Storage. Norbert Meyer, Maciej Brzeźniak, Maciej Stroiński PSNC

A Service for Data-Intensive Computations on Virtual Clusters

SR-IOV In High Performance Computing

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Why are data sharing and reuse so difficult?

T a c k l i ng Big Data w i th High-Performance

The Development of Cloud Interoperability

Cloud Compu)ng in Educa)on and Research

An Open Dynamic Big Data Driven Applica3on System Toolkit

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

New Storage System Solutions

Sun Constellation System: The Open Petascale Computing Architecture

Hadoop. Sunday, November 25, 12

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

SAM-FS - Advanced Storage Management Solutions for High Performance Computing Environments

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Current Status of FEFS for the K computer

SURFsara Data Services

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Secure Hybrid Cloud Infrastructure for Scien5fic Applica5ons

Hybrid Software Architectures for Big

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

NERSC File Systems and How to Use Them

Scala Storage Scale-Out Clustered Storage White Paper

Introduction to DKRZ and to the Visualization Server halo

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Hadoop & its Usage at Facebook

ETERNUS CS High End Unified Data Protection

Chapter 7. Using Hadoop Cluster and MapReduce

Scaling Your Data to the Cloud

LONI Provides UNO High Speed Business Con9nuity A=er Katrina. Speakers: Lonnie Leger, LONI Chris Marshall, UNO

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Get More Scalability and Flexibility for Big Data

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

HPC Update: Engagement Model

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Archival Storage At LANL Past, Present and Future

Hadoop & its Usage at Facebook

U"lizing the SDSC Cloud Storage Service

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

SURFsara HPC Cloud Workshop

Interna'onal Standards Ac'vi'es on Cloud Security EVA KUIPER, CISA CISSP HP ENTERPRISE SECURITY SERVICES

Transcription:

Big Data Research at DKRZ Michael Lautenschlager and Colleagues from DKRZ and Scien:fic Compu:ng Research Group Symposium Big Data in Science Karlsruhe October 7th, 2014

Big Data in Climate Research Big data is an all- encompassing term for any collec:on of data sets so large and complex that it becomes difficult to process using tradi:onal data processing applica:ons. (Wikipedia) Big data usually includes data sets with sizes beyond the ability of commonly used sorware tools to capture, curate, manage, and process data within a tolerable elapsed :me. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. (Snijders, C., Matzat, U., & Reips, U.- D. (2012). Big Data : Big gaps of knowledge in the field of Internet. Interna'onal Journal of Internet Science, 7, 1-5.) Gartner s "3Vs" model for describing big data: increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). (Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety". Gartner. Retrieved 6 February 2001.) Big Data in climate research: Satellite data from Earth observa:on (2 V) Climate model data (1.5 V) Data from observa:onal networks (e.g. Tsunami and Earth quarks) (3V) 2

Outline The German Climate Compu:ng Center (DKRZ) Infrastructure at DKRZ: present and future Climate model data characteris:c DKRZ s HPC infrastructure developments Climate model data produc:on workflow Future Developments 3

The German Climate Computing Center (DKRZ) Founded in 1987 as a na:onal ins:tu:on Operated as a non- profit limited company with four sharehoulders Max Planck Society for Research (55%) The City of Hamburg represented by the University of Hamburg (27%) Alfred Wegener Research Ins:tute in Bremerhaven (9%) Helmholtz Center for Research in Geesthacht (9%) 4

Mission DKRZ Partner for Climate Research Maximum Compute Performance. SophisBcated Data Management. Competent Service. 5

Climate Modelling Support at DKRZ Basic Workflows: Climate Model Development Climate Model Data Production 6

HLRE-2: Architecture Current machine: IBM Power6 8.500 cores, 158 TFLOPS, 20 TB main memory 6 PB disk space, 100 PB tape archive space In 2015 replacement by HLRE-3 7

HLRE-2 to HLRE-3 Increases every ten years: Processor speed: 500x Disk capacity: 100x Disk speed: 11x (17x with SSDs) 8

HLRE-2: Data Archive Focus: climate model data and related observations Content August 2014 HPSS: 36 PB WDCC: 4 PB (fully documented) HPSS: 8 PB/year (HLRE-3: 75 PB/y) WDCC: 0.5 1.5 PB/year (HLRE-3: 8 PB/y) World Data Centre Climate 9

Climate Modelling Example 10 CMIP5 Release WS MPI-M/DKRZ Lautenschlager CAS2K13 (Feb. 2012) (Lautenschlager, (DKRZ), Big Data DKRZ) 10.14

Data Amounts CMIP3/CMIP5 CMIP3 / IPCC- AR4 (Report 2007) Par:cipa:on: 17 modelling centres with 25 models In total 36 TB model data central at PCMDI and ca. ½ TB in IPCC DDC at WDCC/ DKRZ as reference data CMIP5 / IPCC- AR5 (Report 2013/2014) Par:cipa:on: 29 modelling groups with 61 models Produced data volume: ca. 10 PB with 640 TB from MPI- ESM CMIP5 requested data volume: ca. 2 PB (in CMIP5 data federa:on) Part of WDCC/DKRZ Data volume for IPCC DDC: 1.6 PB (complete quality assurance process) with 60 TB from MPI- ESM Status CMIP5 data federated archive (August 2014): 2.3 PB for 69000 data sets stored in 4.3 Mio Files in 23 data nodes CMIP5 data is more than 50 :mes CMIP3 Extrapola:on for CMIP6 data federa:on: Volume: 150 PB Number of files: 280 Mio Files 11

Research: More efficient disk usage Files system: Lustre- ZFS lz4 compression increases energy consump:on by max. 1% But provides compression of 30% in average Potentially applicable to HLRE-3 MDS: meta data server, OSS: object storage server, MDT/OST: meta data / object storage target 12

ParallelizaBon on HPC: Scalability ObjecBve of ICON - able to conduct production simulations on target grids as large as 10 10 grid elements (10 4 x 10 4 x 400) - grid spacing finer than 400 m (100 m being the target) - capable of efficiently using a diversity of advanced high performance computing resources - scaling of the model to many tens, possibly hundreds, of thousands of cores

Parallelization on HPC: R2B07 GLOBAL 20km no I/O Efficiency 83% Parallel I/O under development

DKRZ offers Services for End-to- End Workflows in Climate Modelling Basic Workflows: Climate Model Develeopment Climate Model Data Production Poster: ISC 14, Leipzig, Juni 2014 URL: https://www.dkrz.de/pdfs/poster/dataservices_eng.pdf 15

Climate Model Data Production Workflow Separation: Scientific data analysis Long-term Archiving 16

Climate Model Data Production Workflow 1) Data Management Plan The data :me line as well as volumes, structures, access paoerns and storage loca:ons have to be defined as accurate as possible. 2) DKRZ Storage Each DKRZ HPC project has to specify and to apply for compute and storage resources on an annual basis. 3) ESGF StandardizaBon Climate data integra:on into ESGF (Earth System Grid Federa:on) requires standardiza:on in order to make data intercomparable in the federa:on. 4) ESGF Services DKRZ offers a number of services to manage, discover and access climate data in the interna:onal Earth System Grid Federa:on (ESGF). 5) LTA DOKU LTA DOKU stands for in house long- term archiving in the DOKU(menta:on) sec:on of DKRZ s tape archive. Focus here is on internal data access from data providers. 6) LTA WDCC LTA WDCC stands for long- term archiving in the World Data Center Climate (WDCC). This service is open for data from DKRZ HPC projects, data from ESGF but also for data from outside DKRZ. These data are fully integrated in the database system of the WDCC. The full set of metadata is provided in order to allow for data interpreta:on even arer ten years or more without contac:ng the data author. Focus here is on inter- disciplinary data access. 7) DataCite Data PublicaBon DataCite is an interna:onal organiza:on which focuses on publica:on of scien:fic data in context with and to be used in the scien:fic literature. ARer passing the DataCite Data Publica:on services these data en::es are irrevocable, has been assigned a DOI (Digital Object Iden:fier) and key metadata including the cita:on reference are registered in the DataCite repository. The DOI allows for transparent data access and the cita:on reference allows for integra:on into the reference list of a scien:fic publica:on. 17

Climate Model Data Federation (ESGF) Service 4 ESGF P2P Architecture: Data Node, Index Node, Compute Node, IdP 18

Service 4 ESGF Node 19

Services 5 + 6 LTA: Data ingestion 20

LTA: Data access Services 5 + 6 Midtier Appl. Server TDS (or the like) LobServer Archive: files Container: Lobs Storage@DKRZ HPSS DB Layer CERA What Where Who When How 21

Seamless end- to- end workflow Model data produc:on analysis long- term archiving Model Code op:miza:on Many cores systems, GPUs File systems Compression, parallel I/O, co- design of storage I/O APIs, object store Object Storage OpenStack SwiR Data access Future Big Data Developments Unique Iden:fier (DOI, Handle System) may replace directory structure Network Bandwidth, SLAs, managed networks 22