. or g Chunming Rong Tomasz Wiktor Wlodarczyk Chair (IEEE CloudCom) Big Data Chair (IEEE CloudCom) Head (CIPSI) Administrative Head (CIPSI) Professor (UiS) Associate Professor (UiS) chunming.rong@uis.no chunming.rong@uis.no
Index Characterization of Cloud Computing Characterization of Big Data Comparison with HPC Example Applications Activities and Events to Follow 2
3
What is different in cloud? Resource Virtualization Data-driven Shared 4
Virtualized Computing Resource International standards I EE E E E 802.20 Mobile B WA WAN 3G I E E E 802.16 B WA BAN ET SI HiperA H ipera c c ess es s IE I E E E 802.16a WMA WM A N MAN ET SI HiperM AN HiperMA I E E E 802.11 WL A N LAN ET SI HiperL A N I E E E 802.15 B luetoo th luetooth PAN ET SI HiperP A N ERP Ethernet LAN Backbone Network Backbone Fibre Radiolink Satellite Opr Workstation Ethernet LANLAN Ethernet M N A MAN WiMax PAN Wireless sensornetwork Floater 5
You control Shared control What Changes? Vendor control On Premise IaaS Application Application Server Storage Network PaaS SaaS Application Application Virtual Machine Virtual Machine Server Server Server Storage Storage Storage Network Network Network Virtual Machine Mather, Kumaraswamy and Latif, Cloud Security and Privacy, O Reilly 2009 Center for IP-based Service Innovation 6
Openness Shareability and Freedom Open software Open services Open data 7
2020 1 billion new Internet users 20+ million apps 30+ billion devices 1+ trillion sensors 50+ million petabytes of data Everyone, everything interconnected 8
Requirements by Today s Users Accessibility Access from anywhere and from multiple devices Shareability Make sharing as easy as creating and saving Freedom Users don t want their data held hostage Simplicity Easy-to-learn, easy-to-use Security Trust that data will not be lost or seen by unwanted parties 9
Oceans of Data, Skinny Pipes 1 Terabyte Easy to store Hard to move Disks MB / s Time Seagate Barracuda 115 2.3 hours Seagate Cheetah 125 2.2 hours Networks MB / s Time Home Internet < 0.625 > 18.5 days Gigabit Ethernet < 125 > 2.2 hours 10
Scalable Computing Resource 11
Data-Intensive Computing Challenge For computation that accesses 1 TB in 5 minutes Data distributed over 1000+ disks Assuming uniform data partitioning Compute using 1000+ processors Connected by 10 gigabit Ethernet (or equivalent) System Requirements Lots of disks Lots of processors Virtualized architecture in different locations Huge load on network 12
Desiderata for Data Intensive Systems Focus on Data Problem-Centric Programming From simple queries to massive computations Robust Fault Tolerance Platform-independent expression of data parallelism Interactive Access Petabytes, not peta-flops Component failures are handled as routine events Contrast to existing High Performance Computing (HPC) systems 13
Simplistic Comparison Database Mngt System Data stored according to schema Map/Reduce Declarative query language Many sophisticated optimizations Support small & large queries Limited scaling Data stored as unstructured files User-defined map & reduce functions Runtime system fairly straightforward Batch processing of data only Designed to operate on massive scale 14
Cloud Data Management Database Management Systems Relational Database Management Systems (RDBMS) Object-Oriented Database Management Systems (OODBMS) Non-Relational, Distributed DB Mgmt Systems (NRDBMS) Not only Structured Query Language (NoSQL) Online Transaction Processing (OLTP) Real-time Data Warehousing Online Analytical Processing (OLAP) Operational Data Stores (ODS) Enterprise Data Warehouse (EDW) 15
16
Big-Data Aggregated Data From the Following Sources: Traditional Sensory Social Aggregators Predominantly: NRDBMS Column Family Stores: Key-Values Stores: App Engine DataStore (Google), DynamoDB & SimpleDB (AWS) Document Databases: Cassandra (FaceBook), BigTable (Google), HBase (Apache) CouchDB (Apache), MongoDB Graph Databases: Neo4J (Neo Technology) 17
ERAC project (supported by NFR) Efficient and Robust Architecture for the Big Data Cloud 2012 2016 18
Big-Data Processing Serial Processing Hadoop Hadoop Distributed File System (HDFS) Hive Data Warehouse Pig Querying Language Parallel Processing HadoopDB (Yale) Other Analytics Processing Google MapReduce Splunk for Security Information / Event Management [SIEM] 19
Open-source Java MapReduce for reliable, scalable, distributed computing. Moving Computation is Cheaper than Moving Data 20
21
HadoopDB Open Source Parallel Database A hybrid of DBMS and MapReduce technologies that targets analytical workloads Designed to run on a shared-nothing cluster of commodity machines, or in the cloud An attempt to fill the gap in the market for a free and open source parallel DBMS Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads Database layer with a cluster of multiple single-node DBMS servers Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running eg. PostgreSQL or MySQL Hive as the translation layer A shared-nothing parallel database, that business analysts can interact with using a SQL-like language. 22
23
ERAC project (supported by NFR) Efficient and Robust Architecture for the Big Data Cloud 2012 2016 24
CIPSI Infrastructure: Data cluster Click to edit Master text styles Second level Third level Fourth level Fifth level 39 nodes 1 TB RAM 25
26
Time Series Data To store, index and serve metrics collected from large scale systems To make this data easily accessible and graphable 27
Fine grained Real-time Monitoring Get real-time state information about infrastructure and services Understand outages or how complex systems interact together Measure SLAs (availability, latency, etc.) Tune applications and databases for maximum performance Do capacity planning 28
OpenTSDB 29
Smart Grid @ Clouds 30
Self learning Energy Efficient buildings and open Spaces http://www.seeds-fp7.eu EeB-ICT-2011.6.4 ICT for energy-efficient buildings and spaces of public use UiS Demo Space 31
Second level Click to edit Master text styles Analysis of IoT Data Third level Fourth level Fifth level 32
Click to edit Master text styles Second level Third level Fourth level Fifth level Analysis of IoT Data 33
Aging in Place Safer@Home Click to edit Master text styles Second level Third level Fourth level Fifth level Analysis of IoT Data 34
35
36
e-health @ Clouds Integrated Service Platform Anywhere Access... Home Automation Notificatio Data Analytic Workflow n Server Engine SLA, Billing, Provisioning Monitoring Planning Care Sensors Subscriber, Service Management Internet Security IP-based Networks Social interaction/ Gaming l n rt u c e S Smart Device Access Gateway Home Gateway Firewall Internet Services Health & Care Service Portal Services Stakeholders 37
Secure Architecture Data Collector Configurabl e Secure Lightweight Data Receiver Deidentified Analysis of IoT Data Data Store 38
Lysstyring 39
40
Science @ Cloud 41
Norway and North America Program Education is currently one the biggest challenges in Big Data and Dataintensive Science 2 MNOK (2012 2016) staff and student exchanges; joint curriculum development, teaching and student supervision further funding collaboration 43
201 0. or g 200 9 201 1 201 2 44
CloudCom 2013 Bristol, UK, Dec. 2-5, 2013 http://2013.cloudcom.org
EU-China Workshop on HPC Cloud & Big-Data Stavanger, 20 21 June, 2013 http://euchina2013.cloudcom.org