MapReduce everywhere Carsten Hufe & Michael Hausenblas
About Carsten Hufe Big Data Consultant at comsysto Vodafone Telefonica / o2 Payback Hadoop Ecosystem, Distributed systems Committer of JumboDB Twitter @devproof
About Michael Hausenblas Chief Data Engineer at MapR, responsible for EMEA Background in large-scale data integration Using Hadoop and NoSQL since 2008 Apache Drill contributor Big Data advocate (lambda-architecture.net, sparkstack.org)
Outline Hadoop & MapReduce introduction Experiences from 'SmartSteps' JumboDB Some examples for MapReduce Future and vision
Big Data processing conventional data processing (RDBMS-based) is a special case of Big Data processing (think: Newton s mechanic vs. relativity and quantum mechanics)
General observations Analytics becoming a critical component in business environments Base decisions on (a lot of) data Principle: keep all data around benefit from all data Human generated (think: Excel sheet, CRM system, etc.) Machine generated (think: mobile phone, etc.) Pioneered at Google and Amazon
First Principles Scaling out (horizontal) over scaling up (vertical) Commodity hardware Open Source software (Apache, etc.) Open, community-defined interfaces Schema on read Data locality
Schema on read schema on write established (experience exists) strong typing (validations etc. on DBlevel) forces fixed schema up-front forces one correct view of world raw data dismissed less agile schema on read flexible interpretation of data at load time (agility) raw data stays around allows for unstructured, semi-structured and structured data (typically) weak typing schema handling on app-level
Data locality move processing (code) to the data rather than the other way round why?
~1990 ~2000 ~2010 140 0x disk capacity 2.1 GB 200 GB 3,000 GB -31 price $157/GB $1/GB 00x $0.05/GB 12x transfer rate 16 MB/s 56 MB/s 210 MB/s 2min to read whole disk 58min to read whole disk 4h to read whole disk
RDBMS vs Hadoop RDBMS/MPP schema on write Hadoop on read, on write workload interactive batch (default) but interactive solutions emerging interface SQL core: MapReduce, but SQL-in-Hadoop solutions emerging volume GB++ PB++ variety ETL to tabular no restrictions velocity limited agility DBA/schema + ETL is main bottleneck $$$/TB >>20,000$ stock Hadoop limited, but can be realised with frameworks like Kafka, Storm, etc. very quick roll-outs and results <1000$
Simple algorithms and lots of data trump complex models Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems So combining data together delivers better, more accurate results how can I integrate all that data from my legacy applications? but how do I keep all that data safe? how can I perform at that level of scale?
Distributed Storage Model Distributed Compute Model Google File System MapReduce Designed to run on massive clusters of cheap machines Sends compute to data on GFS, not vice versa Tolerates hardware failure Paper published in 2003 Vastly simplifies distributed programming Paper published in 2004 Runs on commodity hardware. Costs scale linearly.
Distributed File System (HDFS) Map Reduce Runs on commodity hardware
Hadoop 101 Apache Hadoop is an open source software project that provides a major step toward meeting the big data challenge With Hadoop you can have thousands of disks on hundreds of machines with near linear scaling Uses commodity hardware, no need to purchase expensive or specialized hardware Handles Big Data, Petabytes and more
Hadoop History
Architecture MapReduce: Parallel computing Move the computation to the data Storage: Keeping track of data and metadata Data is sharded across the cluster Cluster management tools Applications and tools
Architecture
Nature of MapReduce-able Problems Complex data Multiple data sources Lots of it Nature of Analysis Batch Processing Parallel Execution Data in distributed file system and computation close to data Analysis Applications Text mining Risk Assessment Pattern Recognition Sentiment Analysis Collaborative Filtering Prediction Models
Hadoop Distributed Filesystem
Hadoop Cluster Data Failures are expected and managed gracefully
HDFS NameNode Architecture Data is conceptually record-oriented in the Hadoop programming framework HDFS splits large data files into chunks (default size is 64 MB) Chunks are spread over multiple nodes in the cluster. They are also replicated across the cluster for fault tolerance Shared nothing architecture Chunks form a single namespace and are accessible universally Moving computation to data allows Hadoop framework to achieve high data locality and avoid strain on network bandwidth Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop s configuration files Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster A =Primary Namenode A B B =Standby Namenode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode
The MapReduce Paradigm
SAN Server Data Hadoop Cluster Data Sources lt u s Re ce u d Re p a m M a r g Pro
MapReduce To use Hadoop, a query is expressed as MapReduce jobs MapReduce is a batch process MapReduce accesses an entire dataset, in parallel, in order to reduce seeks In conventional programs, seek time is generally rate-limiting MapReduce is a streaming process that is not limited by seeks MapReduce tasks are pure functions, meaning they are stateless Pure functions have no side effects and thus can be run in any order Pure functions can even be run multiple times if necessary MapReduce jobs are divided into different phases Map tasks Shuffle phase Reduce tasks
Inside MapReduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes and ships and sealing-wax has, [1,5,2] come, 6 come, 1 the, [1,2,1] has, 8 time, [10,1,3] the, 4 time, 14 Input Shuffle Reduce Map Output and sort
MapReduce Key Phases Map phase Input files have been automatically broken into pieces Data is read on each node using large I/O operations for efficiency Mappers run locally to the data in this step, avoiding the need for most network traffic Each input record is transformed by mapper independently, so they can all take place at the same time If your cluster isn t big enough to run them all at the same time, it can run them in multiple waves Output of the mapper is a key and a value The MapReduce framework takes care of handling the output and sending it to the right place
MapReduce Key Phases Shuffle Moves intermediate results to the reducers and collates Provides all communication between computing elements Rearranges data and involves network traffic Reduce Combines mapper outputs Computes final results Output is done using large writes Output of final reducers is stored to disk
What Happens in the Cluster? Disk I/O Highest during map phase when program is reading input data Another peak at end of MapReduce job when final output is written to disk by the reducers Network Shuffle rearranges data and involves large amounts of network traffic Memory Peak memory loads are typically during reduce phase Framework is merging map outputs, reducer is processing merged results Mapper may also have a memory usage peak
Disk I/O Network Memory t Input Map Shuffle and sort Reduce Output
The Hadoop ecosystem
The Hadoop ecosystem
Hive Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that Source: cc-licensed slide by Cloudera
Hive Data Model Tables Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Buckets Hash partitions within ranges (useful for sampling, join optimization) Source: cc-licensed slide by Cloudera
Hive Example Hive looks similar to an SQL database Relational join on two tables: Table of word counts from Shakespeare collection Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the I and to of a you my in is 25848 62394 23031 8854 19671 38985 18038 13526 16700 34654 14170 8057 12702 2720 11297 4135 10797 12445 8882 6884
Pig Latin Pig provides a higher level language, Pig Latin, that: Increases productivity. In one test: 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Opens the system to non-java programmers. Provides common operations like join,group, filter, sort. User Defined Functions first class citizens
Pig Latin Script Example Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP. pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into '/data/good_users'; Pig Slides adapted from Olston et al.
Other ways to write MapReduce jobs Cascading* Scalding (tuples) Scala Cascalog Clojure/Java *) For details see David Whiting s excellent talk Scalding the Crunchy Pig for Cascading into the Hive http://thewit.ch/shug/ Crunch* (functional) Java M/R frameworks for scripting languages such as Python, Ruby, etc. http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
Hadoop 2.0 / YARN In a cluster there are a resources (CPUs, RAM, disks) that need to be managed. In Hadoop 2.0 YARN replaces the MapReduce layer in Hadoop with a more general-purpose scheduler allowing to run in addition to MapReduce jobs other types of workloads (e.g., graph databases, MPI).
MapReduce everywhere? Hadoop MongoDB R Studio Java On-Demand Aggregation
Smart Steps Prototyp Analyze and visualize mobile data Footfalls Catchment Segmentation for socio-demographic characteristics http://dynamicinsights.telefonica.com/488/smart-steps
Smart Steps
Smart Steps - challenges Provide a data pipeline Handle huge amounts of data, which could be queried on demand Limited hardware resources Provide a near 'real time performance'
Smart Steps 1st iteration Web-Application (Java, Spring MVC) MongoDB 2.2 as storage MapReduce with MongoDB and JavaScript MongoDB Sharded
Sample MongoDB Document { "cellid": "12345", "date": "2014-01-20", "hour": 0, "visitors": 15000, "age": { "to10": 1111, "to20": 2222, "to30": 3333 }, "gender": { "male": 4444, "female": 5555 } }
Sample MongoDB MapReduce Result for all cells for a month var mapfunction = function() { var yearandmonth = getyearandmonth(this.date); // e.g. yearandmonth = 2014-01 emit(yearandmonth, this.visitors); }; var reducefunction = function(yearandmonth, visitors) { return Array.sum(visitors); }; db.footfalls.mapreduce( mapfunction, reducefunction, { out: "map_reduce_result" } )
MongoDB MapReduce Result { "results": [ { "_id": "2014-01", "value": 11111 }, { "_id": "2014-02", "value": 12222 }, { "_id": "2014-03", "value": 13333 } ] }
Result MapReduce in MongoDB was singlethreaded per server-instance (version 2.2) JavaScript-Engine was slow Slow import Indexes must fit into memory Reponse times too long
Smart Steps 2nd iteration Web-Application (Java, Spring MVC) MongoDB as storage MapReduce with Hadoop MongoDB Sharded
Result Slow import Indexes must fit in memory Response times too long Not blocked due to single-thread issues
Smart Steps 3rd iteration Web-Application (Java, Spring MVC) MemCached as storage MapReduce with Hadoop Multiple Memcached instances
Result Very fast import Entire data must fit into memory Very good response times Very expensive many instances required Data is not persistent
GAME CHANGED
Smart Steps Last iteration Budget reduced, not enough hardware available How to provide the same amount of data with the new budget? Reducing data will cause loss of user acceptance
Smart Steps Last iteration Web-Application (Java, Spring MVC) JumboDB as storage MapReduce with Hadoop One server instance for application and storage!
Result Very fast import Low memory footprint (less than 5% of index information) Very good response times Very cheap Provides data workflow and versioning
Final architecture RAW events Calculate business aspects Hadoop ecosystem Aggregated data s* pect s a l a hnic Precalculated database c te te a l u c l Ca * sort, index, compress Binary copy Read from Smart Steps Reporting Application
Benchmark comparison: JumboDB, MemCached, MongoDB Import 70 GB data JumboDB MongoDB MemCached 1 server Capacity: 2 TB EBS (with compression 10TB) 1 server Capacity: 2 TB EBS 4 servers Capacity: 4x70GB RAM 7min 30s Querying with 40 criterias Result: 133623 datasets Conclusion 20h 18min 6min 20s Throughput: ~156MB/s 337000 datasets/s Throughput: ~0,95 MB/s 2075 datasets/s Throughput: ~184MB/s 399000 datasets/s With data transfer to client Aborted after 20 minutes With data transfer to client 1220 ms Fast import Fast querying Only one machine! Not possible Slow import Querying not possible 2336 ms Fast import Fast querying But 4 machines
Cost comparison Storing 10 TB data JumboDB MongoDB MemCached 1 server with 70 GB RAM Min. 7 servers with 70 GB RAM 143 server 70 GB RAM Capacity: 2 TB EBS (with compression >10TB) Capacity: 6x2 TB EBS + 1x MongoS Capacity: 143x70GB RAM RAID EBS volume with IOPS RAID EBS with IOPS No EBS volume (currently it was not not possible to use Mongo with more than 500GB!) Only calculating server costs, because EBS volumes are hard to calculate 2000$ 1 server = 2000$ + EBS and IOPS Conclusion Cheapest version 14000$ + EBS and IOPS Relatively cheap 286000$ Expensive But no extra EBS costs!
Reasons for the good performance Database will be calculated in a distributed environment Data is immutable No reorganisation of data required during read operations Preorganized data for the main usecases (e.g. Sorted by geographic region, data can be read in a sequential way) Data is compressed, use the storage more effectively and speed up the read operation
Github: https://github.com/comsysto/jumbodb Wiki: https://github.com/comsysto/jumbodb/wiki
MapReduce example: how to sum on cell base
What is a Geohash? Converts coordinates (lat/long) into a single hash value Invented by Gustavo Niemeyer
How does it work?
How does it work?
How does it work?
Example London, Piccadilly Circus 51.509964-0.134115 24 bit precision 011110101110101110111000 Integer value: 2062268416 Geohash String: u281z
MongoDB Geohash Example var mapfunction = function() { var geohash = getgeohash24bitprecision(this.latitude, this.longitude); emit(geohash, this.visitorid); }; var reducefunction = function(geohash, visitorids) { return visitorids.length; }; db.visits.mapreduce( mapfunction, reducefunction, { out: "users_per_grid_cell" } )
MongoDB MapReduce Result { "results": [ { "_id": "u281z", "value": 11111 }, { "_id": "u282b", "value": 12222 }, { "_id": "d567", "value": 13333 } ] }
Future MapReduce Spark Real-time (from Kafka to Storm) Lambda Architecture SQL on Hadoop Impala Apache Drill Presto
Thank you for your attention!
Smart Steps Workflow Version 1: Here is my first delivery with 'January' data for 'Collection 1' Version 2: Made some optimizations, data should be better Version 3: There was a mistake in the latest delivery. I corrected it! Data Scientist I have new 'February' data and added a new collection 'Collection 2'. Please extend the 'January' data with it. One month later... jumbodb Version 4: New 'February' data for 'Collection 1' and 'Collection 2' Version 5: Made some optimizations to 'February' data. Version 6: Data is much cooler! Data Scientist DAMN! The latest delivery was faulty. I am not able to fix it quickly! Please roll back to 'Version 5'. Reporting application
Smart Steps