Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Similar documents
NoSQL Databases. Nikos Parlavantzas

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data With Hadoop

Apache Hadoop. Alexandru Costan

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Cloud Scale Distributed Data Storage. Jürmo Mehine

Big Data and Apache Hadoop s MapReduce

Lecture Data Warehouse Systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data-Intensive Computing with Map-Reduce and Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Preparing Your Data For Cloud

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

An Approach to Implement Map Reduce with NoSQL Databases


BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

How To Handle Big Data With A Data Scientist

NoSQL Data Base Basics

Hadoop Architecture. Part 1

HadoopRDF : A Scalable RDF Data Analysis System

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Large scale processing using Hadoop. Ján Vaňo

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Internals of Hadoop Application Framework and Distributed File System

Challenges for Data Driven Systems

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Processing with Google s MapReduce. Alexandru Costan

Open source Google-style large scale data analysis with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop

Hadoop Ecosystem B Y R A H I M A.

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Jeffrey D. Ullman slides. MapReduce for data intensive computing

NoSQL Databases. Polyglot Persistence

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Databases 2 (VU) ( )

Integrating Big Data into the Computing Curricula

NoSQL and Hadoop Technologies On Oracle Cloud

MapReduce with Apache Hadoop Analysing Big Data

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data and Analytics (Fall 2015)

Query and Analysis of Data on Electric Consumption Based on Hadoop

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

The Hadoop Framework

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Log Mining Based on Hadoop s Map and Reduce Technique

How To Scale Out Of A Nosql Database

Application Development. A Paradigm Shift

Apache HBase. Crazy dances on the elephant back

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Parallel Data Processing

Comparison of Different Implementation of Inverted Indexes in Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

NoSQL for SQL Professionals William McKnight

Introduction to Hadoop

Parallel Processing of cluster by Map Reduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Manifest for Big Data Pig, Hive & Jaql

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

A very short Intro to Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Performance and Energy Efficiency of. Hadoop deployment models

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop IST 734 SS CHUNG

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY


!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Big Systems, Big Data

Hadoop and Map-Reduce. Swati Gore

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Alternatives to HIVE SQL in Hadoop File Structure

Big Data Management and Analytics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Can the Elephants Handle the NoSQL Onslaught?

Reduction of Data at Namenode in HDFS using harballing Technique

Transcription:

Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015

After this talk Realize the potential: Data vs. Big Data Understand why we need a different paradigm Recognize some of the main terminology Know the existing tools Cumulo NumBio 2015, Aussois, 4 June 2015 2

Outline Big Data overview: - sources of Big Data: YOU! Storage: - SQL vs. NoSQL Processing: - Hadoop MapReduce Cumulo NumBio 2015, Aussois, 4 June 2015 3

1 The Big Data Deluge Cumulo NumBio 2015, Aussois, 4 June 2015 4

Context: the Big Data Deluge Deliver the capability to mine, search and analyze this data in near real time Science itself is evolving Credits: Microsoft Cumulo NumBio 2015, Aussois, 4 June 2015 5

The Data Science: Enable Discovery The 4 th Paradigm for Scientific Discovery. a a 2 = 4πGρ Κ 3 c a 2 2 Thousand years ago Description of natural phenomena Last few hundred years Newton s laws, Maxwell s equations Last few decades Simulation of complex phenomena Today and the Future Unify theory, experiment and simulation with large multidisciplinary Data Using data exploration and data mining (from instruments, sensors, humans ) Crédits: Dennis Gannon Distributed Communities Cumulo NumBio 2015, Aussois, 4 June 2015 6

What is Big Data? Big Data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. (McKinsey Global Institute) Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. (Wikipedia) Cumulo NumBio 2015, Aussois, 4 June 2015 7

How big is Big Data? Eric Schmidt: Every 2 days we create as much information as we did up to 2003. (2011) We created 5 billion GigaBytes (5 ExaBytes) of data. In 2014, the same amount of data is created every 7 minutes. Total size of the Digital Universe in 2014: 4.4 ZetaBytes Cumulo NumBio 2015, Aussois, 4 June 2015 8

Big Data Units Cumulo NumBio 2015, Aussois, 4 June 2015 9

Big picture of Big Data Cumulo NumBio 2015, Aussois, 4 June 2015 10

Common features: 3 Vs These are unstructured data Produced in real-time Arrive in streams or batches from geographically distributed sources Have metadata (localization, day, hour, etc.) Hetereogenous sources (mobile phones, sensorsors, tablets, PCs, clusters) Arrive in disorder and unpredictably Cumulo NumBio 2015, Aussois, 4 June 2015 11

What is needed? Computation/storage power Cloud computing: allows users to lease computing and storage resources in a Pay-As- You-Go manner Programming models MapReduce: Simple yet scalable model for Big Data processing Cumulo NumBio 2015, Aussois, 4 June 2015 12

To cloud or not to cloud my data? Benefits: - Control Structure - Illusion of Unlimited - No up-front commitment ( pay as you go ) - On-demand - (Very) Short-term allocation - Close to 100% Transparency - Increased Platform Independence Core costs: - Storage ($/MByte/year) - Computing ($/CPU Cycles) - Networking ($/bit) Reality is much more mundane! Cumulo NumBio 2015, Aussois, 4 June 2015 13

Geographically distributed datacenters xxx Credits: Microsoft Cumulo NumBio 2015, Aussois, 4 June 2015 15

Data-intensive processing on Clouds: where we are? Costs of outsourcing data to the cloud Computation-to-data latency is high! Scalable concurrent data accesses to shared data Cloud storage used as intermediate for data transfers Cumulo NumBio 2015, Aussois, 4 June 2015 16

2 Storage: SQL vs. NoSQL Cumulo NumBio 2015, Aussois, 4 June 2015 17

Relational databases Dominant model for the last 30 years Standard, easy-to-use, powerful query language SQL: - Declarative: Users state what they and the database internally assembles an algorithm and extracts the requested results Reliability and strong consistency in the presence of failures and concurrent access Support for transactions (ACID properties) Orthogonal to data representation and storage Cumulo NumBio 2015, Aussois, 4 June 2015 18

Weaknesses Relational databases are not designed to run on multiple nodes (clusters) Favor vertical scaling Cannot cope with large volumes of data and operations (e.g., Big Data applications) Cumulo NumBio 2015, Aussois, 4 June 2015 19

Weaknesses Mapping objects to tables is notoriously difficult (impedance mismatch) Cumulo NumBio 2015, Aussois, 4 June 2015 20

NoSQL Practically, anything that deviates from traditional relational database systems (RDBMSs) Running well on clusters Not needing a schema (schema-free) Typically, relaxing consistency Cumulo NumBio 2015, Aussois, 4 June 2015 21

Data models Key-value - Simple hash table where all access is done via a key - Redis, Riak, Memcached Document - Main concept is document - Self-describing, hierarchical data structures - JSON, BSON, XML, etc. - MongoDB, Couchbase, Terrastore, Lotus Notes Column family - Ordered collection of rows, each of which is an ordered collection of columns - Cassandra, HBase, SimpleDB Graph - Declarative, domain-specific query languages - Neo4j, Infinite Graph, FlockDB Cumulo NumBio 2015, Aussois, 4 June 2015 22

Data models Stop following me! Cumulo NumBio 2015, Aussois, 4 June 2015 23

3 Processing: MapReduce and Hadoop Cumulo NumBio 2015, Aussois, 4 June 2015 24

Origins: the problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc. Algorithm to process data can be reasonable simple But to finish it in an acceptable amount of time the task must be split and forwarded to potentially thousands of machines Cumulo NumBio 2015, Aussois, 4 June 2015 25

Origins: the problem Programmers were forced to develop the software that: Splits data Forwards data and code to participant nodes Checks nodes state to react to errors Retrieves and organizes results Tedious, error-prone, time-consuming... and had to be repeated for each problem Cumulo NumBio 2015, Aussois, 4 June 2015 26

The solution: MapReduce MapReduce is an abstraction to organize parallelizable tasks The core idea behind MapReduce is mapping your data set into a collection of <key, value> pairs, and then reducing over all pairs with the same key. Algorithm has to be adapted to fit MapReduce's main two steps. Map: data processing (collecting / grouping) Reduce: data collection and digesting (aggregate, filter, etc.) Procedural: user has to state how to produce the answer The MapReduce framework will take care of data/code transport, nodes coordination, etc. Cumulo NumBio 2015, Aussois, 4 June 2015 27

MapReduce at a glance Cumulo NumBio 2015, Aussois, 4 June 2015 28

More specifically Users implement the interface of two primary functions map(k, v) <k', v'>* reduce(k', <v'>*) <k', v''>* All v' with same k' are reduced together, and processed in v' order Cumulo NumBio 2015, Aussois, 4 June 2015 29

Example 1: word count Cumulo NumBio 2015, Aussois, 4 June 2015 30

Example 2: word length count Cumulo NumBio 2015, Aussois, 4 June 2015 31

Example 2: word length count Cumulo NumBio 2015, Aussois, 4 June 2015 32

Example 2: word length count Cumulo NumBio 2015, Aussois, 4 June 2015 33

Apache Hadoop 34

What is Hadoop? Hadoop is a top-level Apache project Open source implementation of MapReduce Developed in Java Platform for data storage and processing Scalable Fault tolerant Distributed Any type of complex data Cumulo NumBio 2015, Aussois, 4 June 2015 35

Hadoop Eco-System Cumulo NumBio 2015, Aussois, 4 June 2015 36

HDFS Distributed Storage System Files split into 128 MB blocks Blocks replicated across several DataNodes (usually 3) Single NameNode stores metadata (file names, block locations, etc.) Optimized for large files, sequential reads Files are append-only Rack-aware Cumulo NumBio 2015, Aussois, 4 June 2015 37

Hadoop MapReduce Parallel processing for large datasets Relies on HDFS Master-Slave architecture: - Job Tracker schedules and manages jobs - Task Trackers execute individual map() and reduce() task on each cluster node JobTracker and Namenode as well as TaskTrackers and DataNodes are placed on the same machines Cumulo NumBio 2015, Aussois, 4 June 2015 38

MapReduce Programming Model Every MapReduce program must specify a Mapper and typically a Reducer The Mapper has a map() function that transforms input (key, value) pairs into any number of intermediate (out_key, intermediate_value) pairs map(k1 key, V1 value, Context context) The Reducer has a reduce() function that transforms intermediate (out_key, list(intermediate_value)) aggregates into any number of (out_key, value ) pairs void reduce (K2, Iterable<V2> values, Context context) Cumulo NumBio 2015, Aussois, 4 June 2015 39

Word Count Example In Hadoop Cumulo NumBio 2015, Aussois, 4 June 2015 40

Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: - Automatic division of job into tasks - Automatic partition and distribution of data - Automatic placement of computation near data - Recovery from failures Hadoop, an open source implementation of MapReduce, enriched by many useful subprojects User focuses on application, not on complexity of distributed computing Cumulo NumBio 2015, Aussois, 4 June 2015 41

Thank you! Questions? alexandru.costan@inria.fr Cumulo NumBio 2015, Aussois, 4 June 2015 42

Readings Anand Rajaraman, Jeffrey D. Ullman Mining of Massive Datasets, Cambridge University Press Tony Hey, Stewart Tansley, Kristin Tolle The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research Jeffrey Stanton, Introduction to Data Science, Syracuse University Press Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 NoSQL distilled: a brief guide to the emerging world of polyglot persistence, Pramod J. Sadalage and Martin Fowler, Pearson Education, 2012. Seven Databases in Seven Weeks, Eric Redmond and Jim Wilson, Pragmatic Bookshelf, O'Reilly, 2012. Cumulo NumBio 2015, Aussois, 4 June 2015 43

TP Cumulo NumBio 2015, Aussois, 4 June 2015 http://bit.ly/1mopcxq Cumulo NumBio 2015, Aussois, 4 June 2015