Zenith: a hybrid P2P/cloud for Big Scientific Data



Similar documents
Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Cloud & Big Data a perfect marriage? Patrick Valduriez

Distributed Data Management in 2020?

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Data Management in the Cloud -

Principles of Distributed Data Management in 2020?

BIG DATA CHALLENGES AND PERSPECTIVES

NoSQL for SQL Professionals William McKnight

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. Sunday, November 25, 12

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

International Journal of Innovative Research in Computer and Communication Engineering

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Manifest for Big Data Pig, Hive & Jaql

Big Data Explained. An introduction to Big Data Science.

Introduction to Cloud Computing

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Indexing and Processing Big Data

ECS 165A: Introduction to Database Systems

Big Data With Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HPC technology and future architecture

Application Development. A Paradigm Shift

BIG DATA What it is and how to use?

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Centric Computing Revisited

Cloud Computing Now and the Future Development of the IaaS

How To Handle Big Data With A Data Scientist

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

2.1.5 Storing your application s structured data in a cloud database

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

bigdata Managing Scale in Ontological Systems

Open source large scale distributed data management with Google s MapReduce and Bigtable

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

So What s the Big Deal?

Luncheon Webinar Series May 13, 2013

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Data Semantics Aware Cloud for High Performance Analytics

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

IBM System x reference architecture solutions for big data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cluster, Grid, Cloud Concepts

A programming model in Cloud: MapReduce

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

How To Scale Out Of A Nosql Database

Exploiting the power of Big Data

Reference Architecture, Requirements, Gaps, Roles

Challenges for Data Driven Systems

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Putchong Uthayopas, Kasetsart University

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Big Data Challenges in Bioinformatics

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

An Approach to Implement Map Reduce with NoSQL Databases

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data on Cloud Computing- Security Issues

Volunteer Computing, Grid Computing and Cloud Computing: Opportunities for Synergy. Derrick Kondo INRIA, France

CloudDB: A Data Store for all Sizes in the Cloud

Information Management course

Using Proxies to Accelerate Cloud Applications

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Cloud Computing with Microsoft Azure

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 12

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Big data management with IBM General Parallel File System

Next-Generation Cloud Analytics with Amazon Redshift

Open source Google-style large scale data analysis with Hadoop

Deploying Big Data to the Cloud: Roadmap for Success

Data-Intensive Computing with Map-Reduce and Hadoop

The 3 questions to ask yourself about BIG DATA

Cloud Computing at Google. Architecture

Lecture Data Warehouse Systems

High Performance Computing Cloud Computing. Dr. Rami YARED

Cloud Computing Is In Your Future

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Transcription:

Zenith: a hybrid P2P/cloud for Big Scientific Data Esther Pacitti & Patrick Valduriez INRIA & LIRMM Montpellier, France 1 1 Basis for this Talk Zenith: Scientific Data Management on a Large Scale. Esther Pacitti and Patrick Valduriez, ERCIM News 2012(89): (2012). Panel Session: Social Networks and Mobility in the Cloud. Amr El Abbadi and Mohamed F. Mokbel (Moderators), VLDB, Istanbul, 2012. Principles of distributed Data Management in 2020? P. Valduriez. Int. Conf. on Databases and Expert Systems Applications (DEXA), Keynote talk, Toulouse, 2011. 2 2 1

Outline of the Talk Big data Scientific big data Application examples Our approach P2P networks Cloud computing Hybrid P2P/cloud Examples of techniques 3 3 Big Data: what is it? A buzz word! With different meanings depending on your perspective E.g. 100 terabytes is big for an OLTP system, but small for a web search engine A simple definition (Wikipedia) Consists of data sets that grow so large that they become awkward to work with using on-hand database management tools Difficulties include capture, storage, search, sharing, analytics and visualizing How big is big? Moving target: petabyte (10 15 bytes), exabyte (10 18 ), zetabyte (10 21 ), For example, climate modeling data are growing so fast that they will lead to collections of hundreds of exabytes expected by 2020 But size is only one dimension of the problem 4 4 2

Why Big Data Today? Overwhelming amounts of data generated by all kinds of devices, networks and programs E.g. sensors, mobile devices, internet, social networks, computer simulations, satellites, radiotelescopes, LHC, etc. Increasing storage capacity Storage capacity has doubled every 3 years since 1980 with prices steadily going down 295 exabytes: an estimation for the data stored by humankind in 2007 (probably more than 1 zetabyte today) M. Hilbert, P. Lopez. The World s Technological Capacity to Store, Communicate, and Compute Information. Science. Online Feb. 2011 Very useful in a digital world! Massive data can produce high-value information and knowledge 5 Critical for data analysis, decision support, forecasting, business intelligence, research, (data-intensive) science, etc. 5 Big Data Dimensions Scale Refers to massive amounts of data Makes it hard to store and manage, but also to analyze (big analytics) Velocity Continuous data streams are being captured (e.g. from sensors or mobile devices) and produced Makes it hard to perform online processing Heterogeneity Each organization tends to produce and manage its own data, in specific formats, with its own processes Makes data integration very hard Complexity Uncertain data (because of data capture), multiscale data (with lots of dimensions), graph-based data, etc. Makes it hard to analyze 6 6 3

AppEx1: Dark Energy Survey Context: CNPq-INRIA project with LNCC, Rio de Janeiro DES: International astronomic project to explain: Acceleration of the universe Nature of dark energy Data production DECam (570-Megapixel digital camera) takes images of 1GB (more than 1000/night) Images are analyzed; galaxies and stars are identified and catalogued Catalogs are stored in a database Problem Big data The size of the database is expected to be 30 Petabytes Tables with 130 attributes (more than 1000 in total) 7 Complex queries Each query may contain many attributes (>10) 7 AppEx2: Risers Fatigue Analysis (RFA) Context: CNPq-INRIA project with UFRJ and LNCC, Rio de Janeiro (and Petrobras) Pumping oil from ultra-deepwater from thousand meters up to the surface through risers Problem Maintaining and repairing risers under deep water is difficult, costly and critical for the environment (e.g. to prevent oil spill) RFA requires a complex workflow of data-intensive activities which may take a very long time to compute 8 8 4

RFA Workflow Example A typical RFA workflow Input: 1,000 files (about 600MB) containing riser information, such as finite element meshes, winds, waves and sea currents, and case studies Output: 6,000 files (about 37GB) Some activities, e.g. dynamic analysis, are repeated for many different input files, and depending on the mesh refinements and other riser's information, each single execution may take hours to complete The sequential execution of the workflow, on a SGI Altix ICE 8200 (64 CPUs Quad Core) cluster, may take as long as 37 hours 9 9 Scientific Data common features Massive scale, complexity and heterogeneity Manipulated through complex, distributed workflows Important metadata about experiments and their provenance Mostly append-only (with rare updates) Good news: no need for transactions Collaboration among scientists very important e.g. biologists, soil scientists, and geologists working on an environmental project 10 10 5

Scientific Data the Problem Current solutions Typically file-based for complex apps Application-specific (ad hoc), using low-level code libraries Deployed in large-scale HPC environments Cluster, grid, cloud Problem Labor-intensive (development, maintenance) Cannot scale (hard to optimize disk access) Cannot keep pace (the data overload will just make it worse) Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it s going to get 11 worse (Office Science of Data Management challenge DoE) 11 Why not Relational DBMS? RDBMS all have a distributed and parallel version Scalability to VLDB With SQL support for all kinds of data (structured, XML, multimedia, streams, etc.) But the one size fits all approach has reached the limits Loss of performance, simplicity and flexibility for applications with specific, tight requirements Not Only SQL (NOSQL) solutions (e.g. Google Bigtable, Amazon SimpleDB, MapReduce) trade data independence and consistency for scalability, simplicity and flexibility NB: SQL has nothing to do with the problem 12 12 6

Our Approach Study end-to-end solutions From data acquisition and integration to final analysis Capitalize on the principles of data management Data independence to hide implementation details Basis for high-level services that can scale Data partitioning: the basis for distributed and parallel processing Foster declarative languages (algebra, calculus) to manipulate data and workflows, with user-defined functions Exploit highly distributed environments Peer-to-Peer (P2P) Cloud 13 13 P2P Networks Decentralized control No distinction between client and server Each peer can have the same functionality No need for centralized server Massive scale Any computer connected on the internet can be a peer Autonomy and volatility Peers can join or leave the network at will, at any time Many successful apps Content sharing (e.g. Bittorrent) Networking (e.g. Skype) Search (e.g. YaCy) Social networks (e.g. Diaspora) 14 14 7

P2P Architecture Highly distributed Dynamic selforganization Load balancing Parallel processing High availability through massive replication Peer 1 Peer 2 Peer 3 Peer 4 Peer 5 15 15 P2P Benefits Ease of installation and control No need for centralized admin. Scalability to very high numbers of nodes Cheap Leverages existing resources Efficient With structured networks (DHTs) Data privacy One may keep full control over her own data 16 16 8

P2P Drawbacks Decentralized admininistration Hard to control Security In the presence of untrusted nodes Churn and low reliability of user machines Latency in unstructured networks Because of flooding 17 17 Cloud Computing The vision On demand, reliable services provided over the Internet (the cloud ) with easy access to virtually infinite computing, storage and networking resources Simple and effective! Through simple Web interfaces, users can outsource complex tasks Data mgt, system administration, application deployment The complexity of managing the infrastructure gets shifted from the users' organization to the cloud provider Capitalizes on previous computing models Web services, utility computing, cluster computing, virtualization, grid computing Very successful business model Amazon, Microsoft, Google, IBM, etc. 18 18 9

Cloud Architecture Single point of access using Web services Highly centralized Big clusters Replication across sites for high availability Scalability Service level agreement (SLA), accounting and pricing essential Cluster 1 Service Compute Storage nodes nodes nodes Client 1 Client 2 Create VMs start VMs terminate pay WS calls reserve store pay Cluster 2 Service Compute Storage nodes nodes nodes 19 19 Cloud Benefits Reduced cost Customer side: the IT infrastructure needs not be owned and managed, and billed only based on resource consumption Cloud provider side: by sharing costs for multiple customers, reduces its cost of ownership and operation to the minimum Ease of access and use Customers can have access to IT services anytime, from anywhere with an Internet connection Centralized admin. simpler Quality of Service (QoS) The operation of the IT infrastructure by a specialized, experienced provider (including with its own infrastructure) increases QoS Elasticity Easy for customers to deal with sudden increases in loads by simply creating more virtual machines (VMs) 20 20 10

Cloud Drawbacks Centralized architecture with single point of access Hard to move big data in and out High bandwidth required Single point of failure Loss of control Your data can get lost! E.g. Amazon EC2 crash disaster in april 2012 permanently destroyed some customers data Your data can get locked! E.g. Megaupload shutdown by FBI in feb. 2012 Security and privacy Different privacy laws across countries Make it hard to enforce with subcontracting of storage Business models of social networks like Facebook 21 21 Why P2P/cloud? Can combine the best of both P2P and cloud for scientific data management P2P Naturally supports the collaborative nature of scientific applications, with autonomy and decentralized control Peers can be the participants or organizations involved in collaboration and may share data and applications while keeping full control over some of their data Cloud For very-large scale data analysis or very large workflow activities, cloud computing is appropriate as it can provide virtually infinite computing, storage and networking resources P2P/cloud also enables the clean integration of the users 22 own computational resources with different clouds 22 11

P2P/cloud architecture Peer1 Recommendation Data sharing P2P data Data source Zenith P2P data services Peer 2 Shared-data Overlay Network (SON) Peer 3 Cloud 1 Zenith cloud data services Cloud 2 Data analysis and data mining Data analysis and data mining Combines the best of P2P and cloud P2P for data sharing between collaborative participants Cloud for elastic parallel data processing 23 23 AppEx1: Dark Energy Survey Problem How to store big data (tables with 1000 of attributes) in order to process complex queries efficiently Our approach Partition the datasets based on access patterns to co-locate data items accessed together, while minimizing inter-partition correlation 24 24 12

Data Partitioning Key Values Vertical partitioning Basis for column stores (e.g. Vertica) Horizontal partitioning Basis for key-value stores (e.g. Bigtable) 25 25 Dynamic Workload-Based Partitioning Main idea: gradually adapt the partitioning to the new data (instead of repartitioning) When a new data arrives, choose the best partition based on the affinity of data to partitions Affinity(d, p) : the number of queries that access both d and some data from partition p Challenge: how to choose the best partition? We developed an algorithm called DynPart That computes Affinity(d, p) based on the affinity of queries with partitions Its complexity is O( Q * P ) M. Liroz-Gistau, R. Akbarinia, E. Pacitti, F. Porto, P. Valduriez. Dynamic Workload-Based Partitioning for Large-Scale Databases. DEXA (2) 26 2012: 183-190. 26 13

AppEx2: RFA Problem with data-centric scientific workflows Typically complex and manipulating many large datasets Computationally-intensive and data-intensive activities, thus requiring execution in large-scale parallel computers However, parallelization of scientific workflows remains lowlevel, ad-hoc and labor-intensive, which makes it hard to exploit optimization opportunities Solution An algebraic approach (inspired by relational algebra) and a parallel execution model that enable automatic parallelization of scientific workflows E. Ogasarawa, J. Dias, D. Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-Centric Scientific 27 Workflows. VLDB 2011 27 Algebraic Approach Activities consume and produce relations E.g. dynamic analysis consumes tuples, with input parameters and references to input files and produces tuples, with analysis results and references to output files Operators that provide semantics to activities Operators that invoke user programs (map, splitmap, reduce, filter) Relational expressions: SRQuery, Join Query Algebraic transformation rules for optimization and parallelization An execution model for this algebra based on self-contained units of activity activation Inspired by tuple activations for hierarchical parallel systems 28 [Bouganim, Florescu & Valduriez, VLDB 1996] 28 14

Challenges and Research Directions Scalable query processing with big data Scientific workflow management with big data Data provenance management Decentralized recommendation for big data sharing Decentralized data mining with privacy preservation 29 29 15