Development of Earth Science Observational Data Infrastructure of Taiwan. Fang-Pang Lin National Center for High-Performance Computing, Taiwan

Similar documents
Development of Big Data Infrastructure in NCHC: From Sensing to Understanding

Big Data Applications in Disaster Management

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Oracle Big Data SQL Technical Update

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Migrating NASA Archives to Disk: Challenges and Opportunities. NASA Langley Research Center Chris Harris June 2, 2015

Big Data Analytics Nokia

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Big Data and Cloud Computing for GHRSST

Apache Hadoop FileSystem and its Usage in Facebook

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

IBM Global Technology Services November Successfully implementing a private storage cloud to help reduce total cost of ownership

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Chapter 7. Using Hadoop Cluster and MapReduce

Flexible Scalable Hardware independent. Solutions for Long Term Archiving

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

Outline. What is Big data and where they come from? How we deal with Big data?

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

PART 1. Representations of atmospheric phenomena

BIG DATA What it is and how to use?

SURFsara Data Services

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Lesson 1: Introduction to Data Management Why Data Management?

巨 量 資 料 分 層 儲 存 解 決 方 案

Protecting Big Data Data Protection Solutions for the Business Data Lake

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

NASA s Big Data Challenges in Climate Science

Big Data in Enterprise challenges & opportunities. Yuanhao Sun 孙 元 浩 yuanhao.sun@intel.com Software and Service Group

Ganzheitliches Datenmanagement

Challenges for Data Driven Systems

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

BIG Big Data Public Private Forum

Bringing Big Data Modelling into the Hands of Domain Experts

How Does the Cloud Fit into Active Archiving. Active Archive Alliance Panel

Building your Big Data Architecture on Amazon Web Services

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop & its Usage at Facebook

Big + Fast + Safe + Simple = Lowest Technical Risk

irods in complying with Public Research Policy

A standards-based open source processing chain for ocean modeling in the GEOSS Architecture Implementation Pilot Phase 8 (AIP-8)

NetCDF and HDF Data in ArcGIS

Architecting for the Internet of Things & Big Data

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

Diagram 1: Islands of storage across a digital broadcast workflow

Data Centric Computing Revisited

Big Data and Analytics: Challenges and Opportunities

Tier 2 Nearline. As archives grow, Echo grows. Dynamically, cost-effectively and massively. What is nearline? Transfer to Tape

Geospatial Imaging Cloud Storage Capturing the World at Scale with WOS TM. ddn.com. DDN Whitepaper DataDirect Networks. All Rights Reserved.

Luc Declerck AUL, Technology Services Declan Fleming Director, Information Technology Department

Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM)

So What s the Big Deal?

Scala Storage Scale-Out Clustered Storage White Paper

Hadoop & its Usage at Facebook

BIG DATA TRENDS AND TECHNOLOGIES

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

DDN updates object storage platform as it aims to break out of HPC niche

Design and Evolution of the Apache Hadoop File System(HDFS)

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data Storage Challenges for the Industrial Internet of Things

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

August Transforming your Information Infrastructure with IBM s Storage Cloud Solution

The distribution of marine OpenData via distributed data networks and Web APIs. The example of ERDDAP, the message broker and data mediator from NOAA

Hadoop. Sunday, November 25, 12

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Big data management with IBM General Parallel File System

LONG TERM RETENTION OF BIG DATA

Applying Apache Hadoop to NASA s Big Climate Data!

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Databases & Web Applications Lab Big Data Project A

XSEDE Data Analytics Use Cases

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Solution Brief: Archiving Avid Interplay Projects using NLT and XenData

bigdata Managing Scale in Ontological Systems

Oracle Big Data Strategy Simplified Infrastrcuture

EMC IRODS RESOURCE DRIVERS

HPC technology and future architecture

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

A Service for Data-Intensive Computations on Virtual Clusters

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

Analisi di un servizio SRM: StoRM

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Luncheon Webinar Series May 13, 2013

Using Message Brokering and Data Mediation to use Distributed Data Networks of Earth Science Data to Enhance Global Maritime Situational Awareness.

EMC BACKUP MEETS BIG DATA

The Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014

and the world is built on information

Archiving: Session ID: More Than Just Compliance. Frank Orlando

Cloud Based Application Architectures using Smart Computing

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Transcription:

Development of Earth Science Observational Data Infrastructure of Taiwan Fang-Pang Lin National Center for High-Performance Computing, Taiwan GLIF 13, Singapore, 4 Oct 2013

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Data dominate

Ecogrid: Sense the nature in a new way (10 years on)

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Data dominate

Global Lake Ecological Observational Network (GLEON) Lakebase: harvest quality data from internet and collect more than ~25,000 lakes across the world. Global compute service through CONDOR >10 major real time observational data from selected GLEON sites. Source: Pau Hanson 5

Fish4Knowledge human level query for Marine Biology U. of Edinburgh (UK) Video Data: 10 camera years = 10 cameraframes = 112 Tb massive data storage Descriptive data: 20 Tb Fish: 10^10 detections Summary data: 500 Gb Target: 1 sec query answering Centrum Wiskunde & Informatica (Netherlands) UCATANIA (Italy) NCHC (Taiwan) NCHC in Taiwan: sustainable system for data acquisition, storage, and computing. U. of Catania in Italy: fish detection and tracking. U. of Edinburgh in UK: workflow, fish recognition, and fish behavior. CWI in Netherlands: user interfaces. 6

Infrastructure enables Fish4Knowledge Cloud of Resources Clients Storage platform Service Frontend Video Server Data Source Computing platform Storage Nodes Computing Nodes Marine Biologists Application Developers 2013/10/6 7

Work on Linking Open Data (e.g. FAO)

Video Classification (~350K videos from 2009 to 2013) Algae: 9.2% Blurred: 33.5% Complex Scenes: 4.3% Encoding: 23.9% Highly Blurred: 13.9% Normal: 12.9% Unknown: 2.2% 9

The SEATANK Event & Other SEATANK event It happened again! Require scientific evidences for Taiwan to win the lawsuit. (TORI, NSPO, NCHC) 2013.1 Freighter SEATANK stranded in Penhu water 2006 Cargo ship Tzini stranded in Yilan water and leaks >100 tons fuel Oil Other Oceanic temperature data of TW waters only available in JP, if it is before 1990 K.C. Shao. 10

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data change the game!) Modeling from hypothesis to discovery Data dominate

CC image by Sharyn Morrow on Flickr Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure External dependencies (e.g. PKI failure) Format obsolescence Legal encumbrance Human error Malicious attack by human or automated agents Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations and requirements Source: http://blog.alltop.com.tw/story/archives/3278 Source: DataONE

Info. content Time for Publication Specific Details Data Management General Details Retirement or Career Change Accident Death Time (From Michener et al 1997)

The Treasure of NARLabs: Earth Science Observational Data 14

Big Data Infrastructure Challenge in ESOD Data Features: high complexity, large scale, frequency, real-time and stream Big Data Integration of images & data of F2-3 (NSPO), Marine Environmental Databank (TORI), Atmospheric Research Databank (TTFRI), Engineering Geological Database for TSMIP (NCREE) of NARLabs. Data Size: 155TB/yr. (Actual data size will 2~3x) Currently, 103 TB for Q1. Simulation data 200TB from TTFRI. Real time, High frequency Data: Process > 18,000 records/sec. Currently 6,000 records/sec. 15

ESOD Big Data Service ~ 1PB External data sources Data-intensive computing system (e.g. Hadoop) Parallel data server Parallel compute server Parallel query server Continuous query stream Query Result Continuous query results External client d 1 d 2 d 3 Source dataset Derived datasets Parallel file system (e.g., GFS, HDFS, GPFS) Virtually Centralized Resources Source: David O Hallaron Characteristics: Small queries and results Massive data and computation performed on server Examples: Search Photo scene completion Log processing Science analytics 16

ESOD Hardware Architecture Challenges of use of distributed resources Exmaples of Datasets 1 TTFRI 3 1 3 NCREE NSPO 1 2 1G - 10G Fiber TORI 8G Optical Fiber WindRider: 40 G infinitband (Lustre FS) GPFS Preload/Stage ESOD Storage 1 2 1 2 3 NCHC 10G Bridge Gateway GPFS Disk 918TB Tape 433TB 17

ESOD Data Archiving and System Infrastructure HAProxy https ftps NCHC firewall Failover between 3 sites. Automatic load balancing. Transparent file systems: direct read/write of files across different sites. Automatic storage backup. Load balance server Slave Load balance server Master High IOPS WindRider Formosa 2/3/5 Paralllel DBMS Scale out > 1 PB 2 X 10GE 2 X 10GE 2 X 10GE HSM General Parallel File System Tape library HS M Tape library General Parallel File System HSM General Parallel File System Tape library 18 18

ESOD: Data discovery Users Data server Data rule engine(polices) Metadata Catalog DATA DATA DATA DATA Hardware layer (disk, tape, etc) User access to single united data warehouse User request for data Request goes to data server Server looks up information in catalog Catalog tells where the data physically located Server applied rules and serve data 19

Data discovery: Metadata Catalog Use relational database (mysql) with multi-dimensional schema design to speed up searching (migrating to NOSQL solutions now) System Metadata User name space - Address / e-mail / telephone number - Role (administrator, curator, user) - File name space Example: geonetwork - Creation date / size / location / checksum - Owner / access controls Storage resource name space - Capacity / quotas / Type (archive, disk, fast cache) Domain Metadata User-given metadata - Key-Value-Unit Triplets, Annotation - Relational / XML Metadata - Domain-specific Schema Adopt OGC standards 20

ESOD: Smart query and answer Develop a set of control vocabulary based on int l standards, e.g. HDF, NetCDF, OGC etc. Derive RDF triple dataset from common query tasks of ESOD. Combination of visual data plus metadata to support a specific high-level information seeking user task. Design of an interface for graph & visual comparison search. Selected one specific task, that of comparing sets of objects, and designed a prototype interface on top of linked data sets used by the experts to support this task explicitly Develop methods for data provenance. 21

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data change the game!) Modeling from hypothesis to discovery Data dominate Future Key issue: Move Big Data

100G TWAREN/TANET/AS 23