Development of Big Data Infrastructure in NCHC: From Sensing to Understanding Fang-Pang Lin National Center for High-Performance Computing National Applied Research Laboratories, Taiwan ASC 2015
BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS Source: Craig Stires DATA CREATION DATA ACQUISITION INFO PROCESSING BUSINESS PROCESS PRODUCERS ARCHITECTS / ENGINEERS ANALYSTS / SCIENTISTS END USERS Machine and Sensors Shared Nothing Scale-out Storage + SSD Data Exploration Access-Anywhere Analytics Services ON-DEMAND Transaction and Usage Logs Geolocation Mobile Apps Data Email and Messaging VOLUME VELOCITY VARIETY MPP + In-Memory Compute Non-relational DWH Converged Infrastructure Hi-Speed / - Resiliency Networking Hadoop DEEP INSIGHTS REAL-TIME EVENTS Contextualized Data Modeling / Scenarios Forecasting Stream Processing PUSH Context-Aware Business Applications Location-Based Services Alert and Respond Workflow and Interaction Automation VALUE Relationships and Social Influence SYSTEMS INTEGRATION Cloud OBJECTIVES Event Management EMBEDDED DELIVERY MODELS Smart devices and systems 2 2 Copyright IDC (2013)
BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS Source: Craig Stires DATA CREATION DATA ACQUISITION INFO PROCESSING BUSINESS PROCESS PRODUCERS ARCHITECTS / ENGINEERS ANALYSTS / SCIENTISTS END USERS User-gen Text 68% (Data collected) Machine and Sensors Transaction and Usage Logs Geolocation Transactional 67% (Data collected) Mobile Apps Data Email and Messaging Machine or device 41% (Data collected) Relationships and Social Influence VOLUME VELOCITY VARIETY SYSTEMS INTEGRATION Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Non-relational DWH Cloud Managing >5TB data 36% Converged Infrastructure Deployed/ing Hadoop 21% Hi-Speed / - Resiliency Networking Keep/discard which data Hadoop 35% (#1 BD Challenge) DEEP INSIGHTS REAL-TIME EVENTS OBJECTIVES Deployed/ing Data Exploration Text Analytics 26% Contextualized Data Modeling / Scenarios Managing Data Quality (#1 IT Forecasting Challenge) Stream Processing Deployed/ing Event Processing 22% Event Management ON-DEMAND PUSH EMBEDDED DELIVERY MODELS Customer Engagement Access-Anywhere Analytics Services [microsegment, net promoter, personalization] (2013 BD Hotspots) Context-Aware Business Applications Geofencing [retail offers, asset Location-Based tracking] Services (GPS Innovations) Alert and Respond Smart Cities Workflow and Interaction Automation [power, water, traffic, safety] (M2M Innovations) Smart devices and systems VALUE 3 3 Copyright IDC (2013)
APeJ BDA Maturity by Country Market Size (USD) Source: Craig Stires Source: IDC APeJ Big Data Maturity Assessment and Benchmark 2013 (n=802) Source: IDC APeJ Big Data Market Analysis & Forecast 2013-2017, Oct 2013 IDC Visit us at IDC.com and follow us on Twitter: @IDC 4
Trends among Big Data leaders Hyper-competition, regulation, analytics sophistication Source: Craig Stires HK - IaaS investments, Telco asset utilization, FSI risk SGP - FSI microsegment, Telco LBS & data monetization, ICA AUS - FSI personalization, Telco psychographic, Data privacy NZ - Emergency svc, SME SaaS, Fleet mgmt, Farm asset mgmt IDC Visit us at IDC.com and follow us on Twitter: @IDC 5
Trends among Big Data midstream Mass urbanization, first-measured, neighborly pressure Source: Craig Stires KOR - M2M automation, Manu QC, Govt NLP audio analytics PRC - Web 2.0 [ABT] xacts, City CCTV, FSI microfinance TWN - Smart Tourism, Data Privacy, Manu process control IND - Citizen registration, banking the unbanked, Big Data insource IDC Visit us at IDC.com and follow us on Twitter: @IDC 6
The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Phenomena, Behavior etc
Ecogrid: Sense the nature in a new way (10 years on)
The Path from Infrastructure to Data Sensing for Understanding Sensing: Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: Modeling from hypothesis to discovery Phenomena, Behavior.. etc.
Fish4Knowledge human level query for Marine Biology U. of Edinburgh (UK) Video Data: 10 camera years = 10 cameraframes = 112 Tb massive data storage Descriptive data: 20 Tb Fish: 10^10 detections Summary data: 500 Gb Target: 1 sec query answering Centrum Wiskunde & Informatica (Netherlands) UCATANIA (Italy) NCHC NCHC: sustainable system for data acquisition, storage, and computing. U. of Catania in Italy: fish detection and tracking. U. of Edinburgh in UK: workflow, fish recognition, and fish behavior. CWI in Netherlands: user interfaces. 10
Global Lake Ecological Observational Network (GLEON) Lakebase: harvest quality data from internet and collect more than ~25,000 lakes across the world. Global compute service through CONDOR >10 major real time observational data from selected GLEON sites. Source: Pau Hanson 11
Infrastructure enables Fish4Knowledge Cloud of Resources Clients Storage platform Service Frontend Video Server Data Source Computing platform Storage Nodes Computing Nodes Marine Biologists Application Developers 2015/5/25 12
2015/5/25 Work On Linking Open Data (e.g. FAO) 13
Video Classification (~350K videos from 2009 to 2013) Algae: 9.2% Blurred: 33.5% Complex Scenes: 4.3% Encoding: 23.9% Highly Blurred: 13.9% Normal: 12.9% Unknown: 2.2% 14
The SEATANK Event & Other SEATANK event It happened again! Require scientific evidences for Chinese Taipei to win the lawsuit. (TORI, NSPO, NCHC) 2013.1 Freighter SEATANK stranded in Penhu water 2006 Cargo ship Tzini stranded in Yilan water and leaks >100 tons fuel Oil 15
The Path from Infrastructure to Data Sensing for Understanding Sensing: (Networks change the game!) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data change the game!) Modeling from hypothesis to discovery Phenomena, Behavior etc
CC image by Sharyn Morrow on Flickr Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure External dependencies (e.g. PKI failure) Format obsolescence Legal encumbrance Human error Malicious attack by human or automated agents Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations and requirements Source: http://blog.alltop.com.tw/story/archives/3278 Source: DataONE
Info. content Time for Publication Specific Details Data Management General Details Retirement or Career Change Accident Death Time (From Michener et al 1997)
The Treasure of NARLabs: Earth Science Observational Data 19
Big Data Infrastructure Challenge in ESOD Data Features: high complexity, large scale, frequency, real-time and stream Big Data Data Size: 155TB/yr. (Actual data size will 2~3x) Currently, 103 TB for Q1. Simulation data 200TB from TTFRI. Real time, High frequency Data: Process > 18,000 records/sec. Currently 6,000 records/sec. 20
ESOD Big Data Service ~ 1PB External data sources Data-intensive computing system (e.g. Hadoop) Parallel data server Parallel compute server Parallel query server Continuous query stream Query Result Continuous query results External client d 1 d 2 d 3 Source dataset Derived datasets Parallel file system (e.g., GFS, HDFS, GPFS) Virtually Centralized Resources Source: David O Hallaron Characteristics: Small queries and results Massive data and computation performed on server Examples: Search Photo scene completion Log processing Science analytics 21
ESOD Hardware Architecture Challenges of use of distributed resources Exmaples of Datasets 1 TTFRI 3 1 3 NCREE NSPO 1 2 1G - 10G Fiber TORI 8G Optical Fiber WindRider: 40 G infinitband (Lustre FS) GPFS Preload/Stage ESOD Storage 1 2 1 2 3 NCHC 10G Bridge Gateway GPFS Disk 918TB Tape 433TB 22
ESOD Data Archiving and System Infrastructure HAProxy https ftps NCHC firewall Failover between 3 sites. Automatic load balancing. Transparent file systems: direct read/write of files across different sites. Automatic storage backup. Load balance server Slave Load balance server Master High IOPS WindRider Formosa 2/3/5 Paralllel DBMS Scale out > 1 PB 2 X 10GE 2 X 10GE 2 X 10GE HSM General Parallel File System Tape library HS M Tape library General Parallel File System HSM General Parallel File System Tape library 23 23
Data discovery: Metadata Catalog Use relational database (mysql) with multi-dimensional schema design to speed up searching (migrating to NOSQL solutions now) System Metadata User name space - Address / e-mail / telephone number - Role (administrator, curator, user) - File name space Example: geonetwork - Creation date / size / location / checksum - Owner / access controls Storage resource name space - Capacity / quotas / Type (archive, disk, fast cache) Domain Metadata User-given metadata - Key-Value-Unit Triplets, Annotation - Relational / XML Metadata - Domain-specific Schema Adopt OGC standards 24
Smart query and answer Develop a set of control vocabulary based on int l standards, e.g. HDF, NetCDF, OGC etc. Derive RDF triple dataset from common query tasks of ESOD. Combination of visual data plus metadata to support a specific high-level information seeking user task. Design of an interface for graph & visual comparison search. Selected one specific task, that of comparing sets of objects, and designed a prototype interface on top of linked data sets used by the experts to support this task explicitly Develop methods for data provenance. 25
The Path from Infrastructure to Data Sensing for Understanding Sensing: (Infrastructure-Centric ) Evolve since 10 years ago: Ecogrid, SARS Grid, etc Institutional missions based on special vehicles: Satellites, Research Ships & Aircrafts, Met Stations etc. It is growing even larger and broader, e.g. IOT, social network. Understanding: (Data-Centric) Modeling from hypothesis to discovery Phenomena, Behavior etc in a sophisticated manner.
Government Big Data Government Clouds & Open data Big Data 10 government clouds: Health, Food, e-invoice, Transportation, Environment, Finance, Education, Culture, Disaster Prevention, Geospatial Information, Agriculture and e-government, focusing on societal impact via Research & innovation. 3 kinds of data: Infrastructural Data, Public data and Personal data, open but required protection. Open, but Protected Licensing models: Open Government License (OGL), Non-Commercial Government License, Charged License. Access Platform Facility and Systems to enable Data store, Management and Analytics. Access Control with the licensing models and use scenarios.
Government Big Data NCHC/NARLabs provides Big Data Platform, Bridging Excellent Research Societal Impact Culture: Animation Finance: e-invoice Health: Electronic Medical Record Network: Openflow Food Security: Food Traceability Transportation: etag system