MapReduce everywhere. Carsten Hufe & Michael Hausenblas

Transcription

1 MapReduce everywhere Carsten Hufe & Michael Hausenblas

2 About Carsten Hufe Big Data Consultant at comsysto Vodafone Telefonica / o2 Payback Hadoop Ecosystem, Distributed systems Committer of JumboDB

3 About Michael Hausenblas Chief Data Engineer at MapR, responsible for EMEA Background in large-scale data integration Using Hadoop and NoSQL since 2008 Apache Drill contributor Big Data advocate (lambda-architecture.net, sparkstack.org)

4 Outline Hadoop & MapReduce introduction Experiences from 'SmartSteps' JumboDB Some examples for MapReduce Future and vision

5 Big Data processing conventional data processing (RDBMS-based) is a special case of Big Data processing (think: Newton s mechanic vs. relativity and quantum mechanics)

6 General observations Analytics becoming a critical component in business environments Base decisions on (a lot of) data Principle: keep all data around benefit from all data Human generated (think: Excel sheet, CRM system, etc.) Machine generated (think: mobile phone, etc.) Pioneered at Google and Amazon

7 First Principles Scaling out (horizontal) over scaling up (vertical) Commodity hardware Open Source software (Apache, etc.) Open, community-defined interfaces Schema on read Data locality

8 Schema on read schema on write established (experience exists) strong typing (validations etc. on DBlevel) forces fixed schema up-front forces one correct view of world raw data dismissed less agile schema on read flexible interpretation of data at load time (agility) raw data stays around allows for unstructured, semi-structured and structured data (typically) weak typing schema handling on app-level

9 Data locality move processing (code) to the data rather than the other way round why?

10 ~1990 ~2000 ~ x disk capacity 2.1 GB 200 GB 3,000 GB -31 price $157/GB $1/GB 00x $0.05/GB 12x transfer rate 16 MB/s 56 MB/s 210 MB/s 2min to read whole disk 58min to read whole disk 4h to read whole disk

11 RDBMS vs Hadoop RDBMS/MPP schema on write Hadoop on read, on write workload interactive batch (default) but interactive solutions emerging interface SQL core: MapReduce, but SQL-in-Hadoop solutions emerging volume GB++ PB++ variety ETL to tabular no restrictions velocity limited agility DBA/schema + ETL is main bottleneck $$$/TB >>20,000$ stock Hadoop limited, but can be realised with frameworks like Kafka, Storm, etc. very quick roll-outs and results <1000$

12 Simple algorithms and lots of data trump complex models Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems So combining data together delivers better, more accurate results how can I integrate all that data from my legacy applications? but how do I keep all that data safe? how can I perform at that level of scale?

13 Distributed Storage Model Distributed Compute Model Google File System MapReduce Designed to run on massive clusters of cheap machines Sends compute to data on GFS, not vice versa Tolerates hardware failure Paper published in 2003 Vastly simplifies distributed programming Paper published in 2004 Runs on commodity hardware. Costs scale linearly.

14 Distributed File System (HDFS) Map Reduce Runs on commodity hardware

15 Hadoop 101 Apache Hadoop is an open source software project that provides a major step toward meeting the big data challenge With Hadoop you can have thousands of disks on hundreds of machines with near linear scaling Uses commodity hardware, no need to purchase expensive or specialized hardware Handles Big Data, Petabytes and more

16 Hadoop History

17 Architecture MapReduce: Parallel computing Move the computation to the data Storage: Keeping track of data and metadata Data is sharded across the cluster Cluster management tools Applications and tools

18 Architecture

19 Nature of MapReduce-able Problems Complex data Multiple data sources Lots of it Nature of Analysis Batch Processing Parallel Execution Data in distributed file system and computation close to data Analysis Applications Text mining Risk Assessment Pattern Recognition Sentiment Analysis Collaborative Filtering Prediction Models

20 Hadoop Distributed Filesystem

21 Hadoop Cluster Data Failures are expected and managed gracefully

22 HDFS NameNode Architecture Data is conceptually record-oriented in the Hadoop programming framework HDFS splits large data files into chunks (default size is 64 MB) Chunks are spread over multiple nodes in the cluster. They are also replicated across the cluster for fault tolerance Shared nothing architecture Chunks form a single namespace and are accessible universally Moving computation to data allows Hadoop framework to achieve high data locality and avoid strain on network bandwidth Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop s configuration files Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster A =Primary Namenode A B B =Standby Namenode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode

23 The MapReduce Paradigm

24 SAN Server Data Hadoop Cluster Data Sources lt u s Re ce u d Re p a m M a r g Pro

25 MapReduce To use Hadoop, a query is expressed as MapReduce jobs MapReduce is a batch process MapReduce accesses an entire dataset, in parallel, in order to reduce seeks In conventional programs, seek time is generally rate-limiting MapReduce is a streaming process that is not limited by seeks MapReduce tasks are pure functions, meaning they are stateless Pure functions have no side effects and thus can be run in any order Pure functions can even be run multiple times if necessary MapReduce jobs are divided into different phases Map tasks Shuffle phase Reduce tasks

26 Inside MapReduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes and ships and sealing-wax has, [1,5,2] come, 6 come, 1 the, [1,2,1] has, 8 time, [10,1,3] the, 4 time, 14 Input Shuffle Reduce Map Output and sort

27 MapReduce Key Phases Map phase Input files have been automatically broken into pieces Data is read on each node using large I/O operations for efficiency Mappers run locally to the data in this step, avoiding the need for most network traffic Each input record is transformed by mapper independently, so they can all take place at the same time If your cluster isn t big enough to run them all at the same time, it can run them in multiple waves Output of the mapper is a key and a value The MapReduce framework takes care of handling the output and sending it to the right place

28 MapReduce Key Phases Shuffle Moves intermediate results to the reducers and collates Provides all communication between computing elements Rearranges data and involves network traffic Reduce Combines mapper outputs Computes final results Output is done using large writes Output of final reducers is stored to disk

29 What Happens in the Cluster? Disk I/O Highest during map phase when program is reading input data Another peak at end of MapReduce job when final output is written to disk by the reducers Network Shuffle rearranges data and involves large amounts of network traffic Memory Peak memory loads are typically during reduce phase Framework is merging map outputs, reducer is processing merged results Mapper may also have a memory usage peak

30 Disk I/O Network Memory t Input Map Shuffle and sort Reduce Output

31 The Hadoop ecosystem

32 The Hadoop ecosystem

33 Hive Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that Source: cc-licensed slide by Cloudera

34 Hive Data Model Tables Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data) Partitions For example, range-partition tables by date Buckets Hash partitions within ranges (useful for sampling, join optimization) Source: cc-licensed slide by Cloudera

35 Hive Example Hive looks similar to an SQL database Relational join on two tables: Table of word counts from Shakespeare collection Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the I and to of a you my in is

36 Pig Latin Pig provides a higher level language, Pig Latin, that: Increases productivity. In one test: 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Opens the system to non-java programmers. Provides common operations like join,group, filter, sort. User Defined Functions first class citizens

37 Pig Latin Script Example Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP. pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into '/data/good_users'; Pig Slides adapted from Olston et al.

38 Other ways to write MapReduce jobs Cascading* Scalding (tuples) Scala Cascalog Clojure/Java *) For details see David Whiting s excellent talk Scalding the Crunchy Pig for Cascading into the Hive Crunch* (functional) Java M/R frameworks for scripting languages such as Python, Ruby, etc.

39 Hadoop 2.0 / YARN In a cluster there are a resources (CPUs, RAM, disks) that need to be managed. In Hadoop 2.0 YARN replaces the MapReduce layer in Hadoop with a more general-purpose scheduler allowing to run in addition to MapReduce jobs other types of workloads (e.g., graph databases, MPI).

40 MapReduce everywhere? Hadoop MongoDB R Studio Java On-Demand Aggregation

41 Smart Steps Prototyp Analyze and visualize mobile data Footfalls Catchment Segmentation for socio-demographic characteristics

42 Smart Steps

43 Smart Steps - challenges Provide a data pipeline Handle huge amounts of data, which could be queried on demand Limited hardware resources Provide a near 'real time performance'

44 Smart Steps 1st iteration Web-Application (Java, Spring MVC) MongoDB 2.2 as storage MapReduce with MongoDB and JavaScript MongoDB Sharded

45 Sample MongoDB Document { "cellid": "12345", "date": " ", "hour": 0, "visitors": 15000, "age": { "to10": 1111, "to20": 2222, "to30": 3333 }, "gender": { "male": 4444, "female": 5555 } }

46 Sample MongoDB MapReduce Result for all cells for a month var mapfunction = function() { var yearandmonth = getyearandmonth(this.date); // e.g. yearandmonth = emit(yearandmonth, this.visitors); }; var reducefunction = function(yearandmonth, visitors) { return Array.sum(visitors); }; db.footfalls.mapreduce( mapfunction, reducefunction, { out: "map_reduce_result" } )

47 MongoDB MapReduce Result { "results": [ { "_id": " ", "value": }, { "_id": " ", "value": }, { "_id": " ", "value": } ] }

48 Result MapReduce in MongoDB was singlethreaded per server-instance (version 2.2) JavaScript-Engine was slow Slow import Indexes must fit into memory Reponse times too long

49 Smart Steps 2nd iteration Web-Application (Java, Spring MVC) MongoDB as storage MapReduce with Hadoop MongoDB Sharded

50 Result Slow import Indexes must fit in memory Response times too long Not blocked due to single-thread issues

51 Smart Steps 3rd iteration Web-Application (Java, Spring MVC) MemCached as storage MapReduce with Hadoop Multiple Memcached instances

52 Result Very fast import Entire data must fit into memory Very good response times Very expensive many instances required Data is not persistent

53 GAME CHANGED

54 Smart Steps Last iteration Budget reduced, not enough hardware available How to provide the same amount of data with the new budget? Reducing data will cause loss of user acceptance

55 Smart Steps Last iteration Web-Application (Java, Spring MVC) JumboDB as storage MapReduce with Hadoop One server instance for application and storage!

56 Result Very fast import Low memory footprint (less than 5% of index information) Very good response times Very cheap Provides data workflow and versioning

57 Final architecture RAW events Calculate business aspects Hadoop ecosystem Aggregated data s* pect s a l a hnic Precalculated database c te te a l u c l Ca * sort, index, compress Binary copy Read from Smart Steps Reporting Application

58 Benchmark comparison: JumboDB, MemCached, MongoDB Import 70 GB data JumboDB MongoDB MemCached 1 server Capacity: 2 TB EBS (with compression 10TB) 1 server Capacity: 2 TB EBS 4 servers Capacity: 4x70GB RAM 7min 30s Querying with 40 criterias Result: datasets Conclusion 20h 18min 6min 20s Throughput: ~156MB/s datasets/s Throughput: ~0,95 MB/s 2075 datasets/s Throughput: ~184MB/s datasets/s With data transfer to client Aborted after 20 minutes With data transfer to client 1220 ms Fast import Fast querying Only one machine! Not possible Slow import Querying not possible 2336 ms Fast import Fast querying But 4 machines

59 Cost comparison Storing 10 TB data JumboDB MongoDB MemCached 1 server with 70 GB RAM Min. 7 servers with 70 GB RAM 143 server 70 GB RAM Capacity: 2 TB EBS (with compression >10TB) Capacity: 6x2 TB EBS + 1x MongoS Capacity: 143x70GB RAM RAID EBS volume with IOPS RAID EBS with IOPS No EBS volume (currently it was not not possible to use Mongo with more than 500GB!) Only calculating server costs, because EBS volumes are hard to calculate 2000$ 1 server = 2000$ + EBS and IOPS Conclusion Cheapest version 14000$ + EBS and IOPS Relatively cheap $ Expensive But no extra EBS costs!

60 Reasons for the good performance Database will be calculated in a distributed environment Data is immutable No reorganisation of data required during read operations Preorganized data for the main usecases (e.g. Sorted by geographic region, data can be read in a sequential way) Data is compressed, use the storage more effectively and speed up the read operation

61 Github: Wiki:

62

63 MapReduce example: how to sum on cell base

64 What is a Geohash? Converts coordinates (lat/long) into a single hash value Invented by Gustavo Niemeyer

65 How does it work?

68 Example London, Piccadilly Circus bit precision Integer value: Geohash String: u281z

69 MongoDB Geohash Example var mapfunction = function() { var geohash = getgeohash24bitprecision(this.latitude, this.longitude); emit(geohash, this.visitorid); }; var reducefunction = function(geohash, visitorids) { return visitorids.length; }; db.visits.mapreduce( mapfunction, reducefunction, { out: "users_per_grid_cell" } )

70 MongoDB MapReduce Result { "results": [ { "_id": "u281z", "value": }, { "_id": "u282b", "value": }, { "_id": "d567", "value": } ] }

71 Future MapReduce Spark Real-time (from Kafka to Storm) Lambda Architecture SQL on Hadoop Impala Apache Drill Presto

72 Thank you for your attention!

73 Smart Steps Workflow Version 1: Here is my first delivery with 'January' data for 'Collection 1' Version 2: Made some optimizations, data should be better Version 3: There was a mistake in the latest delivery. I corrected it! Data Scientist I have new 'February' data and added a new collection 'Collection 2'. Please extend the 'January' data with it. One month later... jumbodb Version 4: New 'February' data for 'Collection 1' and 'Collection 2' Version 5: Made some optimizations to 'February' data. Version 6: Data is much cooler! Data Scientist DAMN! The latest delivery was faulty. I am not able to fix it quickly! Please roll back to 'Version 5'. Reporting application

74 Smart Steps