Development of Big Data Infrastructure in NCHC: From Sensing to Understanding



Similar documents
End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

The Enterprise Data Hub and The Modern Information Architecture

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Building a Datacenter Infrastructure to Support Your Big Data Plans

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data and Analytics: Challenges and Opportunities

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

EO Data by using SAP HANA Spatial Hinnerk Gildhoff, Head of HANA Spatial, SAP Satellite Masters Conference 21 th October 2015 Public

Enabling Manufacturing Transformation in a Connected World. John Shewchuk Technical Fellow DX

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

The 4 Pillars of Technosoft s Big Data Practice

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

COMP9321 Web Application Engineering

Big Data Analytics Nokia

How To Make Data Streaming A Real Time Intelligence

BIG DATA-AS-A-SERVICE

Big Data, Physics, and the Industrial Internet! How Modeling & Analytics are Making the World Work Better."

Big Data Analytics in Space Exploration and Entrepreneurship

Data Refinery with Big Data Aspects

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Architecting for the Internet of Things & Big Data

Industrial Dr. Stefan Bungart

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Oracle Big Data Strategy Simplified Infrastrcuture

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data Executive Survey

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

A New Era Of Analytic

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

BIG DATA + ANALYTICS

How To Use Hp Vertica Ondemand

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Turn your information into a competitive advantage

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Enterprise Application Enablement for the Internet of Things

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The Future of Data Management

BIG DATA What it is and how to use?

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

HDP Hadoop From concept to deployment.

Luncheon Webinar Series May 13, 2013

Oracle Big Data SQL Technical Update

The IoT Inc Business Meetup Silicon Valley

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Cloud Computing for GHRSST

The Internet of Things

The distribution of marine OpenData via distributed data networks and Web APIs. The example of ERDDAP, the message broker and data mediator from NOAA

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

The future of Big Data A United Hitachi View

VIEWPOINT. High Performance Analytics. Industry Context and Trends

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Ganzheitliches Datenmanagement

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Are You Ready for Big Data?

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Transforming the Telecoms Business using Big Data and Analytics

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Informix The Intelligent Database for IoT

Chapter 7. Using Hadoop Cluster and MapReduce

ANALYTICS IN BIG DATA ERA

Hybrid Cloud Architectures for Operational Performance Management

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Investor Presentation. Second Quarter 2015

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

INVENTING THE FUTURE HITACHI DATA SYSTEMS BIG DATA ROADMAP MICHAEL HAY

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Smart Cities Solution Overview Innovation Center Network, Research & Innovation. SAP SE Reiner Bildmayer

Big Data Are You Ready? Thomas Kyte

Harnessing the Data Flood: Oracle s Visionary Platform from Device to Data Center. Chris Baker Senior Vice President Worldwide ISV/OEM Java Sales

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Build Your Competitive Edge in Big Data with Cisco. Rick Speyer Senior Global Marketing Manager Big Data Cisco Systems 6/25/2015

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

Johan Hallberg Research Manager / Industry Analyst IDC Nordic Services & Sourcing Digital Transformation Global CIO Agenda

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

NIST Big Data PWG & RDA Big Data Infrastructure WG: Implementation Strategy: Best Practice Guideline for Big Data Application Development

Big Data Comes of Age: Shifting to a Real-time Data Platform

Transcription:

Development of Big Data Infrastructure in NCHC: From Sensing to Understanding Fang-Pang Lin National Center for High-Performance Computing National Applied Research Laboratories, Taiwan ASC 2015

BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS Source: Craig Stires DATA CREATION DATA ACQUISITION INFO PROCESSING BUSINESS PROCESS PRODUCERS ARCHITECTS / ENGINEERS ANALYSTS / SCIENTISTS END USERS Machine and Sensors Shared Nothing Scale-out Storage + SSD Data Exploration Access-Anywhere Analytics Services ON-DEMAND Transaction and Usage Logs Geolocation Mobile Apps Data Email and Messaging VOLUME VELOCITY VARIETY MPP + In-Memory Compute Non-relational DWH Converged Infrastructure Hi-Speed / - Resiliency Networking Hadoop DEEP INSIGHTS REAL-TIME EVENTS Contextualized Data Modeling / Scenarios Forecasting Stream Processing PUSH Context-Aware Business Applications Location-Based Services Alert and Respond Workflow and Interaction Automation VALUE Relationships and Social Influence SYSTEMS INTEGRATION Cloud OBJECTIVES Event Management EMBEDDED DELIVERY MODELS Smart devices and systems 2 2 Copyright IDC (2013)

BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS Source: Craig Stires DATA CREATION DATA ACQUISITION INFO PROCESSING BUSINESS PROCESS PRODUCERS ARCHITECTS / ENGINEERS ANALYSTS / SCIENTISTS END USERS User-gen Text 68% (Data collected) Machine and Sensors Transaction and Usage Logs Geolocation Transactional 67% (Data collected) Mobile Apps Data Email and Messaging Machine or device 41% (Data collected) Relationships and Social Influence VOLUME VELOCITY VARIETY SYSTEMS INTEGRATION Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Non-relational DWH Cloud Managing >5TB data 36% Converged Infrastructure Deployed/ing Hadoop 21% Hi-Speed / - Resiliency Networking Keep/discard which data Hadoop 35% (#1 BD Challenge) DEEP INSIGHTS REAL-TIME EVENTS OBJECTIVES Deployed/ing Data Exploration Text Analytics 26% Contextualized Data Modeling / Scenarios Managing Data Quality (#1 IT Forecasting Challenge) Stream Processing Deployed/ing Event Processing 22% Event Management ON-DEMAND PUSH EMBEDDED DELIVERY MODELS Customer Engagement Access-Anywhere Analytics Services [microsegment, net promoter, personalization] (2013 BD Hotspots) Context-Aware Business Applications Geofencing [retail offers, asset Location-Based tracking] Services (GPS Innovations) Alert and Respond Smart Cities Workflow and Interaction Automation [power, water, traffic, safety] (M2M Innovations) Smart devices and systems VALUE 3 3 Copyright IDC (2013)

APeJ BDA Maturity by Country Market Size (USD) Source: Craig Stires Source: IDC APeJ Big Data Maturity Assessment and Benchmark 2013 (n=802) Source: IDC APeJ Big Data Market Analysis & Forecast 2013-2017, Oct 2013 IDC Visit us at IDC.com and follow us on Twitter: @IDC 4

Trends among Big Data leaders Hyper-competition, regulation, analytics sophistication Source: Craig Stires HK - IaaS investments, Telco asset utilization, FSI risk SGP - FSI microsegment, Telco LBS & data monetization, ICA AUS - FSI personalization, Telco psychographic, Data privacy NZ - Emergency svc, SME SaaS, Fleet mgmt, Farm asset mgmt IDC Visit us at IDC.com and follow us on Twitter: @IDC 5

Trends among Big Data midstream Mass urbanization, first-measured, neighborly pressure Source: Craig Stires KOR - M2M automation, Manu QC, Govt NLP audio analytics PRC - Web 2.0 [ABT] xacts, City CCTV, FSI microfinance TWN - Smart Tourism, Data Privacy, Manu process control IND - Citizen registration, banking the unbanked, Big Data insource IDC Visit us at IDC.com and follow us on Twitter: @IDC 6

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Phenomena, Behavior etc

Ecogrid: Sense the nature in a new way (10 years on)

The Path from Infrastructure to Data Sensing for Understanding Sensing: Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Phenomena, Behavior.. etc.

Fish4Knowledge human level query for Marine Biology U. of Edinburgh (UK) Video Data: 10 camera years = 10 cameraframes = 112 Tb massive data storage Descriptive data: 20 Tb Fish: 10^10 detections Summary data: 500 Gb Target: 1 sec query answering Centrum Wiskunde & Informatica (Netherlands) UCATANIA (Italy) NCHC NCHC: sustainable system for data acquisition, storage, and computing. U. of Catania in Italy: fish detection and tracking. U. of Edinburgh in UK: workflow, fish recognition, and fish behavior. CWI in Netherlands: user interfaces. 10

Global Lake Ecological Observational Network (GLEON) Lakebase: harvest quality data from internet and collect more than ~25,000 lakes across the world. Global compute service through CONDOR >10 major real time observational data from selected GLEON sites. Source: Pau Hanson 11

Infrastructure enables Fish4Knowledge Cloud of Resources Clients Storage platform Service Frontend Video Server Data Source Computing platform Storage Nodes Computing Nodes Marine Biologists Application Developers 2015/5/25 12

2015/5/25 Work On Linking Open Data (e.g. FAO) 13

Video Classification (~350K videos from 2009 to 2013) Algae: 9.2% Blurred: 33.5% Complex Scenes: 4.3% Encoding: 23.9% Highly Blurred: 13.9% Normal: 12.9% Unknown: 2.2% 14

The SEATANK Event & Other SEATANK event It happened again! Require scientific evidences for Chinese Taipei to win the lawsuit. (TORI, NSPO, NCHC) 2013.1 Freighter SEATANK stranded in Penhu water 2006 Cargo ship Tzini stranded in Yilan water and leaks >100 tons fuel Oil 15

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data change the game!) Modeling from hypothesis to discovery Phenomena, Behavior etc

CC image by Sharyn Morrow on Flickr Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure External dependencies (e.g. PKI failure) Format obsolescence Legal encumbrance Human error Malicious attack by human or automated agents Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations and requirements Source: http://blog.alltop.com.tw/story/archives/3278 Source: DataONE

Info. content Time for Publication Specific Details Data Management General Details Retirement or Career Change Accident Death Time (From Michener et al 1997)

The Treasure of NARLabs: Earth Science Observational Data 19

Big Data Infrastructure Challenge in ESOD Data Features: high complexity, large scale, frequency, real-time and stream Big Data Data Size: 155TB/yr. (Actual data size will 2~3x) Currently, 103 TB for Q1. Simulation data 200TB from TTFRI. Real time, High frequency Data: Process > 18,000 records/sec. Currently 6,000 records/sec. 20

ESOD Big Data Service ~ 1PB External data sources Data-intensive computing system (e.g. Hadoop) Parallel data server Parallel compute server Parallel query server Continuous query stream Query Result Continuous query results External client d 1 d 2 d 3 Source dataset Derived datasets Parallel file system (e.g., GFS, HDFS, GPFS) Virtually Centralized Resources Source: David O Hallaron Characteristics: Small queries and results Massive data and computation performed on server Examples: Search Photo scene completion Log processing Science analytics 21

ESOD Hardware Architecture Challenges of use of distributed resources Exmaples of Datasets 1 TTFRI 3 1 3 NCREE NSPO 1 2 1G - 10G Fiber TORI 8G Optical Fiber WindRider: 40 G infinitband (Lustre FS) GPFS Preload/Stage ESOD Storage 1 2 1 2 3 NCHC 10G Bridge Gateway GPFS Disk 918TB Tape 433TB 22

ESOD Data Archiving and System Infrastructure HAProxy https ftps NCHC firewall Failover between 3 sites. Automatic load balancing. Transparent file systems: direct read/write of files across different sites. Automatic storage backup. Load balance server Slave Load balance server Master High IOPS WindRider Formosa 2/3/5 Paralllel DBMS Scale out > 1 PB 2 X 10GE 2 X 10GE 2 X 10GE HSM General Parallel File System Tape library HS M Tape library General Parallel File System HSM General Parallel File System Tape library 23 23

Data discovery: Metadata Catalog Use relational database (mysql) with multi-dimensional schema design to speed up searching (migrating to NOSQL solutions now) System Metadata User name space - Address / e-mail / telephone number - Role (administrator, curator, user) - File name space Example: geonetwork - Creation date / size / location / checksum - Owner / access controls Storage resource name space - Capacity / quotas / Type (archive, disk, fast cache) Domain Metadata User-given metadata - Key-Value-Unit Triplets, Annotation - Relational / XML Metadata - Domain-specific Schema Adopt OGC standards 24

Smart query and answer Develop a set of control vocabulary based on int l standards, e.g. HDF, NetCDF, OGC etc. Derive RDF triple dataset from common query tasks of ESOD. Combination of visual data plus metadata to support a specific high-level information seeking user task. Design of an interface for graph & visual comparison search. Selected one specific task, that of comparing sets of objects, and designed a prototype interface on top of linked data sets used by the experts to support this task explicitly Develop methods for data provenance. 25

The Path from Infrastructure to Data Sensing for Understanding Sensing: (Infrastructure-Centric ) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data-Centric) Modeling from hypothesis to discovery Phenomena, Behavior etc in a sophisticated manner.

Government Big Data Government Clouds & Open data Big Data 10 government clouds: Health, Food, e-invoice, Transportation, Environment, Finance, Education, Culture, Disaster Prevention, Geospatial Information, Agriculture and e-government, focusing on societal impact via Research & innovation. 3 kinds of data: Infrastructural Data, Public data and Personal data, open but required protection. Open, but Protected Licensing models: Open Government License (OGL), Non-Commercial Government License, Charged License. Access Platform Facility and Systems to enable Data store, Management and Analytics. Access Control with the licensing models and use scenarios.

Government Big Data NCHC/NARLabs provides Big Data Platform, Bridging Excellent Research Societal Impact Culture: Animation Finance: e-invoice Health: Electronic Medical Record Network: Openflow Food Security: Food Traceability Transportation: etag system