The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Size: px

Start display at page:

Download "The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @"

Agatha Dalton
10 years ago
Views:

1 The Top 10 7 Hadoop Patterns and Anti-patterns Alex

2 whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since grepalex.com

3 what s hadoop... and what can i do with it?

5 Jonathan Khoo via Flickr

6 Format Look at 7 real-world problems Each problem will have different ways they can be solved You get to vote on your favorite choice I ll discuss and suggest best practices

7 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

MapReduce objects Computation Study 5: Data exploration

8 Hadoop in 1 slide Stream processing Predictive analytics YARN applications SQL DAG Graph MapReduce Hadoop, the kernel for big data YARN (resource scheduler) HDFS (distributed filesystem)

9 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

10 I need OUR ORACLE DATA TO BE LOADED INTO HADOOP FOR ANALYTICS

11 Vote A B Build an ETL solution that I can customize to my needs Research and use an third-party solution

12 Sqoop Custom ETL Hadoop Problems you ll need to solve: Reliability/fault tolerance Scalability/throughput Handle large tables Throttle network IO Idempotent writes Scheduling

13 OH, AND WE NEED To use the same data to calculate aggregates in real-time

14 Sqoop Hadoop? NoSQL

15 Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort Rec. Engine Analytics Security Search Monitoring Social Graph

16 Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort kafka Rec. Engine Analytics Security Search Monitoring Social Graph

17 Takeaways Avoid DIY solutions if possible Use Sqoop for relational exports/imports to Hadoop Adopt Kafka for general purpose data integration

18 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

19 Save downloaded webpages in HDFS Crawler Crawler Web Crawler Hadoop Crawler Crawler

20 Vote A B C Write a separate file for each URL. Buffer and write coalesced records to 1MB files. Buffer and write coalesced records to 1GB files.

21 NameNode

22 600 bytes in memory

23 10 9 files ~= 60GB RAM

24 Small files fallout Heap pressure on NameNodes Hard to fix once you discover problem Performance anti-pattern - HDFS designed for large files

25 Solutions Coalesce records as you write Compact small files hdfscompact HDFS Federation Explore MapR s Hadoop distribution

26 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

27 Hadoop

28 Vote A B C Partition your data by date. Write all the records into a single directory. A single HDFS directory doesn t support large data volumes, so design writes so that each directory doesn t exceed 1TB.

29 Hadoop discarding DWH research

30 Partition your data Understand access patterns Talk to your users about how they ll use the data At a minimum partition your data by date

31 Partitioning hdfs:/data/tweets/date= / hdfs:/data/tweets/date= / hdfs:/data/tweets/date= /

32 Takeaways Full tablescans are expensive in Hadoop Talk to your users about access patterns Partition your data, at a minimum using the date

33 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

34 shuffle Map Map Map Map 1TB Reduce Reduce HDFS HDFS Map Reduce HDFS 223MB/s Map = 74 mins

35 Calculate the minimum value for each word Input dataset banana 1 banana 7 Min value for banana is 1 banana 2

36 Identity Mapper

37 Calculate the minimum value

38 Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2

39 Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2

40 banana 1 banana 7 banana 2

41 The fix

42 Takeaways MapReduce tries to reduce GC by reusing Writable objects Never store the reference to the Writable in the reduce iterator Rookie mistake in MapReduce (I make it all the time)

43 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

44 why ARE the number of disengaged users on an upward trend?... i want results today!

45 Activities you ll need to perform Execute low-latency queries Accounts Rapidly work with your code in a shell Events Iterative processing Connections

46 Vote A B C MapReduce SQL-on-Hadoop Another tool

47 MapReduce!= low-latency

48 Replicated disk barriers Map Map Map Map write barrier write barrier Reduce Reduce Reduce write barrier Reduce Reduce Reduce Map Map write barrier write barrier Reduce Reduce Reduce write barrier Map Map write barrier Reduce Reduce Reduce write barrier

50 MapReduce is verbose

51 SQL Hive Impala Drill Spark SQL

52 Spark

53 Spark

54 Takeaways MapReduce isn t the best fit for iterative/graph processing or for interactive data discovery MapReduce should be reserved for production jobs that are a good match for that style of computing Spark, Hive and Impala offer high-level low-latency interactive access to your data

55 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

56 I need you to calculate the top 10 trending news articles in real-time

57 App App Kafka HBase Hadoop

58 there s a bug in your aggregation logic!

59 Vote A B Use a batch processing tier to recalculate aggregates Replay Kafka data to Storm

60 HBase App Kafka Camus HDFS MapReduce / Spark

61 Takeaways Correcting bugs and backfilling data is the dirty reality of working with streaming systems Consider writing aggregation code in a way that can be leveraged in both streaming and batch Take a look at Summingbird (a Twitter project) which implements the Lambda architecture Read Nathan Marz s book Big Data

62 Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

63 Resident Hadoop admin Working with Hadoop for 1 year Manages a number of small Hadoop clusters photos/pitadel/

64 Your organization has to build, setup and manage a 1,000 node Hadoop cluster to support critical product features.! What should you do?

65 Vote A B C D Use Bob! Get Bob certified as a Hadoop admin Build a DevOps team Get a support contract

66 Hadoop admin requires DevOps

67 An average day for Hadoop DevOps Replacing failed hard drives Diagnosing a DataNode failure by correlating logs across multiple machines Debugging Hadoop code, patching their Hadoop distribution and contributing back Dealing with data scientists who are determined to bring down the cluster

68 Takeaways To support a mission-critical Hadoop cluster you must have (ideally more than one) DevOps engineers on hand who are familiar with Hadoop admin and patching Hadoop code A vendor support contract can be a welcome bonus Vendor will (eventually) patch your issues and roll them into their next release

69 Conclusion Hadoop is a powerful tool that integrates well with other systems Helps solve data integration Useful for solving many problems, but look out for anti-patterns Make sure you have support

70 Shameless plug Book signing at JavaOne bookstore - 1pm today! BOF Using Kafka to Optimize Data Movement and System Integration

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @

Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need