MapReduce everywhere. Carsten Hufe & Michael Hausenblas

Similar documents

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

MapReduce with Apache Hadoop Analysing Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem B Y R A H I M A.

Hadoop IST 734 SS CHUNG

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop and Map-Reduce. Swati Gore

Unified Big Data Processing with Apache Spark. Matei

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Large scale processing using Hadoop. Ján Vaňo

Data-Intensive Computing with Map-Reduce and Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop & Spark Using Amazon EMR

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Chapter 7. Using Hadoop Cluster and MapReduce

Design and Evolution of the Apache Hadoop File System(HDFS)

BIG DATA What it is and how to use?

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop implementation of MapReduce computational model. Ján Vaňo

Using distributed technologies to analyze Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Application Development. A Paradigm Shift

Hadoop Architecture. Part 1

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Course Highlights

How To Create A Data Visualization With Apache Spark And Zeppelin

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How To Scale Out Of A Nosql Database

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Internals of Hadoop Application Framework and Distributed File System

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop & its Usage at Facebook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Weather Analytics Using Hadoop

Hadoop Job Oriented Training Agenda

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

The Future of Data Management

Hadoop & its Usage at Facebook

Big Data Too Big To Ignore

Apache Hadoop FileSystem and its Usage in Facebook

<Insert Picture Here> Big Data

How To Handle Big Data With A Data Scientist

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Moving From Hadoop to Spark

Hadoop. Sunday, November 25, 12

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Relational Processing on MapReduce

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

A very short Intro to Hadoop

Map Reduce / Hadoop / HDFS

Intro to Map/Reduce a.k.a. Hadoop

Real Time Big Data Processing

Workshop on Hadoop with Big Data

Scaling Out With Apache Spark. DTL Meeting Slides based on

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

THE HADOOP DISTRIBUTED FILE SYSTEM

Parquet. Columnar storage for the people

Parallel Processing of cluster by Map Reduce

Oracle Big Data SQL Technical Update

Alternatives to HIVE SQL in Hadoop File Structure

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Bringing Big Data Modelling into the Hands of Domain Experts

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

A Brief Outline on Bigdata Hadoop

HiBench Introduction. Carson Wang Software & Services Group

International Journal of Advance Research in Computer Science and Management Studies

Suresh Lakavath csir urdip Pune, India

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

NoSQL for SQL Professionals William McKnight

Big Data Analytics - Accelerated. stream-horizon.com

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Workshop. dattamsha.com

CS54100: Database Systems

Can the Elephants Handle the NoSQL Onslaught?

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Transcription:

MapReduce everywhere Carsten Hufe & Michael Hausenblas

About Carsten Hufe Big Data Consultant at comsysto Vodafone Telefonica / o2 Payback Hadoop Ecosystem, Distributed systems Committer of JumboDB Twitter @devproof

About Michael Hausenblas Chief Data Engineer at MapR, responsible for EMEA Background in large-scale data integration Using Hadoop and NoSQL since 2008 Apache Drill contributor Big Data advocate (lambda-architecture.net, sparkstack.org)

Outline Hadoop & MapReduce introduction Experiences from 'SmartSteps' JumboDB Some examples for MapReduce Future and vision

Big Data processing conventional data processing (RDBMS-based) is a special case of Big Data processing (think: Newton s mechanic vs. relativity and quantum mechanics)

General observations Analytics becoming a critical component in business environments Base decisions on (a lot of) data Principle: keep all data around benefit from all data Human generated (think: Excel sheet, CRM system, etc.) Machine generated (think: mobile phone, etc.) Pioneered at Google and Amazon

First Principles Scaling out (horizontal) over scaling up (vertical) Commodity hardware Open Source software (Apache, etc.) Open, community-defined interfaces Schema on read Data locality

Schema on read schema on write established (experience exists) strong typing (validations etc. on DBlevel) forces fixed schema up-front forces one correct view of world raw data dismissed less agile schema on read flexible interpretation of data at load time (agility) raw data stays around allows for unstructured, semi-structured and structured data (typically) weak typing schema handling on app-level

Data locality move processing (code) to the data rather than the other way round why?

~1990 ~2000 ~2010 140 0x disk capacity 2.1 GB 200 GB 3,000 GB -31 price $157/GB $1/GB 00x $0.05/GB 12x transfer rate 16 MB/s 56 MB/s 210 MB/s 2min to read whole disk 58min to read whole disk 4h to read whole disk

RDBMS vs Hadoop RDBMS/MPP schema on write Hadoop on read, on write workload interactive batch (default) but interactive solutions emerging interface SQL core: MapReduce, but SQL-in-Hadoop solutions emerging volume GB++ PB++ variety ETL to tabular no restrictions velocity limited agility DBA/schema + ETL is main bottleneck $$$/TB >>20,000$ stock Hadoop limited, but can be realised with frameworks like Kafka, Storm, etc. very quick roll-outs and results <1000$

Simple algorithms and lots of data trump complex models Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems So combining data together delivers better, more accurate results how can I integrate all that data from my legacy applications? but how do I keep all that data safe? how can I perform at that level of scale?

Distributed Storage Model Distributed Compute Model Google File System MapReduce Designed to run on massive clusters of cheap machines Sends compute to data on GFS, not vice versa Tolerates hardware failure Paper published in 2003 Vastly simplifies distributed programming Paper published in 2004 Runs on commodity hardware. Costs scale linearly.

Distributed File System (HDFS) Map Reduce Runs on commodity hardware

Hadoop 101 Apache Hadoop is an open source software project that provides a major step toward meeting the big data challenge With Hadoop you can have thousands of disks on hundreds of machines with near linear scaling Uses commodity hardware, no need to purchase expensive or specialized hardware Handles Big Data, Petabytes and more

Hadoop History

Architecture MapReduce: Parallel computing Move the computation to the data Storage: Keeping track of data and metadata Data is sharded across the cluster Cluster management tools Applications and tools

Architecture

Nature of MapReduce-able Problems Complex data Multiple data sources Lots of it Nature of Analysis Batch Processing Parallel Execution Data in distributed file system and computation close to data Analysis Applications Text mining Risk Assessment Pattern Recognition Sentiment Analysis Collaborative Filtering Prediction Models

Hadoop Distributed Filesystem

Hadoop Cluster Data Failures are expected and managed gracefully

HDFS NameNode Architecture Data is conceptually record-oriented in the Hadoop programming framework HDFS splits large data files into chunks (default size is 64 MB) Chunks are spread over multiple nodes in the cluster. They are also replicated across the cluster for fault tolerance Shared nothing architecture Chunks form a single namespace and are accessible universally Moving computation to data allows Hadoop framework to achieve high data locality and avoid strain on network bandwidth Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop s configuration files Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster A =Primary Namenode A B B =Standby Namenode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode

The MapReduce Paradigm

SAN Server Data Hadoop Cluster Data Sources lt u s Re ce u d Re p a m M a r g Pro

MapReduce To use Hadoop, a query is expressed as MapReduce jobs MapReduce is a batch process MapReduce accesses an entire dataset, in parallel, in order to reduce seeks In conventional programs, seek time is generally rate-limiting MapReduce is a streaming process that is not limited by seeks MapReduce tasks are pure functions, meaning they are stateless Pure functions have no side effects and thus can be run in any order Pure functions can even be run multiple times if necessary MapReduce jobs are divided into different phases Map tasks Shuffle phase Reduce tasks

Inside MapReduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes and ships and sealing-wax has, [1,5,2] come, 6 come, 1 the, [1,2,1] has, 8 time, [10,1,3] the, 4 time, 14 Input Shuffle Reduce Map Output and sort

MapReduce Key Phases Map phase Input files have been automatically broken into pieces Data is read on each node using large I/O operations for efficiency Mappers run locally to the data in this step, avoiding the need for most network traffic Each input record is transformed by mapper independently, so they can all take place at the same time If your cluster isn t big enough to run them all at the same time, it can run them in multiple waves Output of the mapper is a key and a value The MapReduce framework takes care of handling the output and sending it to the right place

MapReduce Key Phases Shuffle Moves intermediate results to the reducers and collates Provides all communication between computing elements Rearranges data and involves network traffic Reduce Combines mapper outputs Computes final results Output is done using large writes Output of final reducers is stored to disk

What Happens in the Cluster? Disk I/O Highest during map phase when program is reading input data Another peak at end of MapReduce job when final output is written to disk by the reducers Network Shuffle rearranges data and involves large amounts of network traffic Memory Peak memory loads are typically during reduce phase Framework is merging map outputs, reducer is processing merged results Mapper may also have a memory usage peak

Disk I/O Network Memory t Input Map Shuffle and sort Reduce Output

The Hadoop ecosystem

The Hadoop ecosystem

Hive Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that Source: cc-licensed slide by Cloudera

Hive Data Model Tables Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Buckets Hash partitions within ranges (useful for sampling, join optimization) Source: cc-licensed slide by Cloudera

Hive Example Hive looks similar to an SQL database Relational join on two tables: Table of word counts from Shakespeare collection Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the I and to of a you my in is 25848 62394 23031 8854 19671 38985 18038 13526 16700 34654 14170 8057 12702 2720 11297 4135 10797 12445 8882 6884

Pig Latin Pig provides a higher level language, Pig Latin, that: Increases productivity. In one test: 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Opens the system to non-java programmers. Provides common operations like join,group, filter, sort. User Defined Functions first class citizens

Pig Latin Script Example Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP. pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into '/data/good_users'; Pig Slides adapted from Olston et al.

Other ways to write MapReduce jobs Cascading* Scalding (tuples) Scala Cascalog Clojure/Java *) For details see David Whiting s excellent talk Scalding the Crunchy Pig for Cascading into the Hive http://thewit.ch/shug/ Crunch* (functional) Java M/R frameworks for scripting languages such as Python, Ruby, etc. http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html

Hadoop 2.0 / YARN In a cluster there are a resources (CPUs, RAM, disks) that need to be managed. In Hadoop 2.0 YARN replaces the MapReduce layer in Hadoop with a more general-purpose scheduler allowing to run in addition to MapReduce jobs other types of workloads (e.g., graph databases, MPI).

MapReduce everywhere? Hadoop MongoDB R Studio Java On-Demand Aggregation

Smart Steps Prototyp Analyze and visualize mobile data Footfalls Catchment Segmentation for socio-demographic characteristics http://dynamicinsights.telefonica.com/488/smart-steps

Smart Steps

Smart Steps - challenges Provide a data pipeline Handle huge amounts of data, which could be queried on demand Limited hardware resources Provide a near 'real time performance'

Smart Steps 1st iteration Web-Application (Java, Spring MVC) MongoDB 2.2 as storage MapReduce with MongoDB and JavaScript MongoDB Sharded

Sample MongoDB Document { "cellid": "12345", "date": "2014-01-20", "hour": 0, "visitors": 15000, "age": { "to10": 1111, "to20": 2222, "to30": 3333 }, "gender": { "male": 4444, "female": 5555 } }

Sample MongoDB MapReduce Result for all cells for a month var mapfunction = function() { var yearandmonth = getyearandmonth(this.date); // e.g. yearandmonth = 2014-01 emit(yearandmonth, this.visitors); }; var reducefunction = function(yearandmonth, visitors) { return Array.sum(visitors); }; db.footfalls.mapreduce( mapfunction, reducefunction, { out: "map_reduce_result" } )

MongoDB MapReduce Result { "results": [ { "_id": "2014-01", "value": 11111 }, { "_id": "2014-02", "value": 12222 }, { "_id": "2014-03", "value": 13333 } ] }

Result MapReduce in MongoDB was singlethreaded per server-instance (version 2.2) JavaScript-Engine was slow Slow import Indexes must fit into memory Reponse times too long

Smart Steps 2nd iteration Web-Application (Java, Spring MVC) MongoDB as storage MapReduce with Hadoop MongoDB Sharded

Result Slow import Indexes must fit in memory Response times too long Not blocked due to single-thread issues

Smart Steps 3rd iteration Web-Application (Java, Spring MVC) MemCached as storage MapReduce with Hadoop Multiple Memcached instances

Result Very fast import Entire data must fit into memory Very good response times Very expensive many instances required Data is not persistent

GAME CHANGED

Smart Steps Last iteration Budget reduced, not enough hardware available How to provide the same amount of data with the new budget? Reducing data will cause loss of user acceptance

Smart Steps Last iteration Web-Application (Java, Spring MVC) JumboDB as storage MapReduce with Hadoop One server instance for application and storage!

Result Very fast import Low memory footprint (less than 5% of index information) Very good response times Very cheap Provides data workflow and versioning

Final architecture RAW events Calculate business aspects Hadoop ecosystem Aggregated data s* pect s a l a hnic Precalculated database c te te a l u c l Ca * sort, index, compress Binary copy Read from Smart Steps Reporting Application

Benchmark comparison: JumboDB, MemCached, MongoDB Import 70 GB data JumboDB MongoDB MemCached 1 server Capacity: 2 TB EBS (with compression 10TB) 1 server Capacity: 2 TB EBS 4 servers Capacity: 4x70GB RAM 7min 30s Querying with 40 criterias Result: 133623 datasets Conclusion 20h 18min 6min 20s Throughput: ~156MB/s 337000 datasets/s Throughput: ~0,95 MB/s 2075 datasets/s Throughput: ~184MB/s 399000 datasets/s With data transfer to client Aborted after 20 minutes With data transfer to client 1220 ms Fast import Fast querying Only one machine! Not possible Slow import Querying not possible 2336 ms Fast import Fast querying But 4 machines

Cost comparison Storing 10 TB data JumboDB MongoDB MemCached 1 server with 70 GB RAM Min. 7 servers with 70 GB RAM 143 server 70 GB RAM Capacity: 2 TB EBS (with compression >10TB) Capacity: 6x2 TB EBS + 1x MongoS Capacity: 143x70GB RAM RAID EBS volume with IOPS RAID EBS with IOPS No EBS volume (currently it was not not possible to use Mongo with more than 500GB!) Only calculating server costs, because EBS volumes are hard to calculate 2000$ 1 server = 2000$ + EBS and IOPS Conclusion Cheapest version 14000$ + EBS and IOPS Relatively cheap 286000$ Expensive But no extra EBS costs!

Reasons for the good performance Database will be calculated in a distributed environment Data is immutable No reorganisation of data required during read operations Preorganized data for the main usecases (e.g. Sorted by geographic region, data can be read in a sequential way) Data is compressed, use the storage more effectively and speed up the read operation

Github: https://github.com/comsysto/jumbodb Wiki: https://github.com/comsysto/jumbodb/wiki

MapReduce example: how to sum on cell base

What is a Geohash? Converts coordinates (lat/long) into a single hash value Invented by Gustavo Niemeyer

How does it work?

How does it work?

How does it work?

Example London, Piccadilly Circus 51.509964-0.134115 24 bit precision 011110101110101110111000 Integer value: 2062268416 Geohash String: u281z

MongoDB Geohash Example var mapfunction = function() { var geohash = getgeohash24bitprecision(this.latitude, this.longitude); emit(geohash, this.visitorid); }; var reducefunction = function(geohash, visitorids) { return visitorids.length; }; db.visits.mapreduce( mapfunction, reducefunction, { out: "users_per_grid_cell" } )

MongoDB MapReduce Result { "results": [ { "_id": "u281z", "value": 11111 }, { "_id": "u282b", "value": 12222 }, { "_id": "d567", "value": 13333 } ] }

Future MapReduce Spark Real-time (from Kafka to Storm) Lambda Architecture SQL on Hadoop Impala Apache Drill Presto

Thank you for your attention!

Smart Steps Workflow Version 1: Here is my first delivery with 'January' data for 'Collection 1' Version 2: Made some optimizations, data should be better Version 3: There was a mistake in the latest delivery. I corrected it! Data Scientist I have new 'February' data and added a new collection 'Collection 2'. Please extend the 'January' data with it. One month later... jumbodb Version 4: New 'February' data for 'Collection 1' and 'Collection 2' Version 5: Made some optimizations to 'February' data. Version 6: Data is much cooler! Data Scientist DAMN! The latest delivery was faulty. I am not able to fix it quickly! Please roll back to 'Version 5'. Reporting application

Smart Steps