Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015

After this talk Realize the potential: Data vs. Big Data Understand why we need a different paradigm Recognize some of the main terminology Know the existing tools Cumulo NumBio 2015, Aussois, 4 June 2015 2

Outline Big Data overview: - sources of Big Data: YOU! Storage: - SQL vs. NoSQL Processing: - Hadoop MapReduce Cumulo NumBio 2015, Aussois, 4 June 2015 3

1 The Big Data Deluge Cumulo NumBio 2015, Aussois, 4 June 2015 4

Context: the Big Data Deluge Deliver the capability to mine, search and analyze this data in near real time Science itself is evolving Credits: Microsoft Cumulo NumBio 2015, Aussois, 4 June 2015 5

The Data Science: Enable Discovery The 4 th Paradigm for Scientific Discovery. a a 2 = 4πGρ Κ 3 c a 2 2 Thousand years ago Description of natural phenomena Last few hundred years Newton s laws, Maxwell s equations Last few decades Simulation of complex phenomena Today and the Future Unify theory, experiment and simulation with large multidisciplinary Data Using data exploration and data mining (from instruments, sensors, humans ) Crédits: Dennis Gannon Distributed Communities Cumulo NumBio 2015, Aussois, 4 June 2015 6

What is Big Data? Big Data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. (McKinsey Global Institute) Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. (Wikipedia) Cumulo NumBio 2015, Aussois, 4 June 2015 7

How big is Big Data? Eric Schmidt: Every 2 days we create as much information as we did up to 2003. (2011) We created 5 billion GigaBytes (5 ExaBytes) of data. In 2014, the same amount of data is created every 7 minutes. Total size of the Digital Universe in 2014: 4.4 ZetaBytes Cumulo NumBio 2015, Aussois, 4 June 2015 8

Big Data Units Cumulo NumBio 2015, Aussois, 4 June 2015 9

Big picture of Big Data Cumulo NumBio 2015, Aussois, 4 June 2015 10

Common features: 3 Vs These are unstructured data Produced in real-time Arrive in streams or batches from geographically distributed sources Have metadata (localization, day, hour, etc.) Hetereogenous sources (mobile phones, sensorsors, tablets, PCs, clusters) Arrive in disorder and unpredictably Cumulo NumBio 2015, Aussois, 4 June 2015 11

What is needed? Computation/storage power Cloud computing: allows users to lease computing and storage resources in a Pay-As- You-Go manner Programming models MapReduce: Simple yet scalable model for Big Data processing Cumulo NumBio 2015, Aussois, 4 June 2015 12

To cloud or not to cloud my data? Benefits: - Control Structure - Illusion of Unlimited - No up-front commitment ( pay as you go ) - On-demand - (Very) Short-term allocation - Close to 100% Transparency - Increased Platform Independence Core costs: - Storage ($/MByte/year) - Computing ($/CPU Cycles) - Networking ($/bit) Reality is much more mundane! Cumulo NumBio 2015, Aussois, 4 June 2015 13

Geographically distributed datacenters xxx Credits: Microsoft Cumulo NumBio 2015, Aussois, 4 June 2015 15

Data-intensive processing on Clouds: where we are? Costs of outsourcing data to the cloud Computation-to-data latency is high! Scalable concurrent data accesses to shared data Cloud storage used as intermediate for data transfers Cumulo NumBio 2015, Aussois, 4 June 2015 16

2 Storage: SQL vs. NoSQL Cumulo NumBio 2015, Aussois, 4 June 2015 17

Relational databases Dominant model for the last 30 years Standard, easy-to-use, powerful query language SQL: - Declarative: Users state what they and the database internally assembles an algorithm and extracts the requested results Reliability and strong consistency in the presence of failures and concurrent access Support for transactions (ACID properties) Orthogonal to data representation and storage Cumulo NumBio 2015, Aussois, 4 June 2015 18

Weaknesses Relational databases are not designed to run on multiple nodes (clusters) Favor vertical scaling Cannot cope with large volumes of data and operations (e.g., Big Data applications) Cumulo NumBio 2015, Aussois, 4 June 2015 19

Weaknesses Mapping objects to tables is notoriously difficult (impedance mismatch) Cumulo NumBio 2015, Aussois, 4 June 2015 20

NoSQL Practically, anything that deviates from traditional relational database systems (RDBMSs) Running well on clusters Not needing a schema (schema-free) Typically, relaxing consistency Cumulo NumBio 2015, Aussois, 4 June 2015 21

Data models Key-value - Simple hash table where all access is done via a key - Redis, Riak, Memcached Document - Main concept is document - Self-describing, hierarchical data structures - JSON, BSON, XML, etc. - MongoDB, Couchbase, Terrastore, Lotus Notes Column family - Ordered collection of rows, each of which is an ordered collection of columns - Cassandra, HBase, SimpleDB Graph - Declarative, domain-specific query languages - Neo4j, Infinite Graph, FlockDB Cumulo NumBio 2015, Aussois, 4 June 2015 22

Data models Stop following me! Cumulo NumBio 2015, Aussois, 4 June 2015 23

3 Processing: MapReduce and Hadoop Cumulo NumBio 2015, Aussois, 4 June 2015 24

Origins: the problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc. Algorithm to process data can be reasonable simple But to finish it in an acceptable amount of time the task must be split and forwarded to potentially thousands of machines Cumulo NumBio 2015, Aussois, 4 June 2015 25

Origins: the problem Programmers were forced to develop the software that: Splits data Forwards data and code to participant nodes Checks nodes state to react to errors Retrieves and organizes results Tedious, error-prone, time-consuming... and had to be repeated for each problem Cumulo NumBio 2015, Aussois, 4 June 2015 26

The solution: MapReduce MapReduce is an abstraction to organize parallelizable tasks The core idea behind MapReduce is mapping your data set into a collection of <key, value> pairs, and then reducing over all pairs with the same key. Algorithm has to be adapted to fit MapReduce's main two steps. Map: data processing (collecting / grouping) Reduce: data collection and digesting (aggregate, filter, etc.) Procedural: user has to state how to produce the answer The MapReduce framework will take care of data/code transport, nodes coordination, etc. Cumulo NumBio 2015, Aussois, 4 June 2015 27

MapReduce at a glance Cumulo NumBio 2015, Aussois, 4 June 2015 28

More specifically Users implement the interface of two primary functions map(k, v) <k', v'>* reduce(k', <v'>*) <k', v''>* All v' with same k' are reduced together, and processed in v' order Cumulo NumBio 2015, Aussois, 4 June 2015 29

Example 1: word count Cumulo NumBio 2015, Aussois, 4 June 2015 30

Example 2: word length count Cumulo NumBio 2015, Aussois, 4 June 2015 31

Apache Hadoop 34

What is Hadoop? Hadoop is a top-level Apache project Open source implementation of MapReduce Developed in Java Platform for data storage and processing Scalable Fault tolerant Distributed Any type of complex data Cumulo NumBio 2015, Aussois, 4 June 2015 35

Hadoop Eco-System Cumulo NumBio 2015, Aussois, 4 June 2015 36

HDFS Distributed Storage System Files split into 128 MB blocks Blocks replicated across several DataNodes (usually 3) Single NameNode stores metadata (file names, block locations, etc.) Optimized for large files, sequential reads Files are append-only Rack-aware Cumulo NumBio 2015, Aussois, 4 June 2015 37

Hadoop MapReduce Parallel processing for large datasets Relies on HDFS Master-Slave architecture: - Job Tracker schedules and manages jobs - Task Trackers execute individual map() and reduce() task on each cluster node JobTracker and Namenode as well as TaskTrackers and DataNodes are placed on the same machines Cumulo NumBio 2015, Aussois, 4 June 2015 38

MapReduce Programming Model Every MapReduce program must specify a Mapper and typically a Reducer The Mapper has a map() function that transforms input (key, value) pairs into any number of intermediate (out_key, intermediate_value) pairs map(k1 key, V1 value, Context context) The Reducer has a reduce() function that transforms intermediate (out_key, list(intermediate_value)) aggregates into any number of (out_key, value ) pairs void reduce (K2, Iterable<V2> values, Context context) Cumulo NumBio 2015, Aussois, 4 June 2015 39

Word Count Example In Hadoop Cumulo NumBio 2015, Aussois, 4 June 2015 40

Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: - Automatic division of job into tasks - Automatic partition and distribution of data - Automatic placement of computation near data - Recovery from failures Hadoop, an open source implementation of MapReduce, enriched by many useful subprojects User focuses on application, not on complexity of distributed computing Cumulo NumBio 2015, Aussois, 4 June 2015 41

Thank you! Questions? alexandru.costan@inria.fr Cumulo NumBio 2015, Aussois, 4 June 2015 42

Readings Anand Rajaraman, Jeffrey D. Ullman Mining of Massive Datasets, Cambridge University Press Tony Hey, Stewart Tansley, Kristin Tolle The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research Jeffrey Stanton, Introduction to Data Science, Syracuse University Press Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 NoSQL distilled: a brief guide to the emerging world of polyglot persistence, Pramod J. Sadalage and Martin Fowler, Pearson Education, 2012. Seven Databases in Seven Weeks, Eric Redmond and Jim Wilson, Pragmatic Bookshelf, O'Reilly, 2012. Cumulo NumBio 2015, Aussois, 4 June 2015 43

TP Cumulo NumBio 2015, Aussois, 4 June 2015 http://bit.ly/1mopcxq Cumulo NumBio 2015, Aussois, 4 June 2015