The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @
whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com
what s hadoop... and what can i do with it?
Jonathan Khoo via Flickr
Format Look at 7 real-world problems Each problem will have different ways they can be solved You get to vote on your favorite choice I ll discuss and suggest best practices
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
Hadoop in 1 slide Stream processing Predictive analytics YARN applications SQL DAG Graph MapReduce Hadoop, the kernel for big data YARN (resource scheduler) HDFS (distributed filesystem)
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
I need OUR ORACLE DATA TO BE LOADED INTO HADOOP FOR ANALYTICS
Vote A B Build an ETL solution that I can customize to my needs Research and use an third-party solution
Sqoop Custom ETL Hadoop Problems you ll need to solve: Reliability/fault tolerance Scalability/throughput Handle large tables Throttle network IO Idempotent writes Scheduling
OH, AND WE NEED To use the same data to calculate aggregates in real-time
Sqoop Hadoop? NoSQL
Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort Rec. Engine Analytics Security Search Monitoring Social Graph
Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort kafka Rec. Engine Analytics Security Search Monitoring Social Graph
Takeaways Avoid DIY solutions if possible Use Sqoop for relational exports/imports to Hadoop http://sqoop.apache.org/ Adopt Kafka for general purpose data integration http://kafka.apache.org/
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
Save downloaded webpages in HDFS Crawler Crawler Web Crawler Hadoop Crawler Crawler
Vote A B C Write a separate file for each URL. Buffer and write coalesced records to 1MB files. Buffer and write coalesced records to 1GB files.
NameNode
600 bytes in memory
10 9 files ~= 60GB RAM
Small files fallout Heap pressure on NameNodes Hard to fix once you discover problem Performance anti-pattern - HDFS designed for large files
Solutions Coalesce records as you write Compact small files https://github.com/alexholmes/ hdfscompact HDFS Federation Explore MapR s Hadoop distribution
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
Hadoop
Vote A B C Partition your data by date. Write all the records into a single directory. A single HDFS directory doesn t support large data volumes, so design writes so that each directory doesn t exceed 1TB.
Hadoop discarding DWH research
Partition your data Understand access patterns Talk to your users about how they ll use the data At a minimum partition your data by date
Partitioning hdfs:/data/tweets/date=20140929/ hdfs:/data/tweets/date=20140930/ hdfs:/data/tweets/date=20140931/
Takeaways Full tablescans are expensive in Hadoop Talk to your users about access patterns Partition your data, at a minimum using the date
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
shuffle Map Map Map Map 1TB Reduce Reduce HDFS HDFS Map Reduce HDFS 223MB/s Map = 74 mins
Calculate the minimum value for each word Input dataset banana 1 banana 7 Min value for banana is 1 banana 2
Identity Mapper
Calculate the minimum value
Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2
Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2
banana 1 banana 7 banana 2
The fix
Takeaways MapReduce tries to reduce GC by reusing Writable objects Never store the reference to the Writable in the reduce iterator Rookie mistake in MapReduce (I make it all the time)
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
why ARE the number of disengaged users on an upward trend?... i want results today!
Activities you ll need to perform Execute low-latency queries Accounts Rapidly work with your code in a shell Events Iterative processing Connections
Vote A B C MapReduce SQL-on-Hadoop Another tool
MapReduce!= low-latency
Replicated disk barriers Map Map Map Map write barrier write barrier Reduce Reduce Reduce write barrier Reduce Reduce Reduce Map Map write barrier write barrier Reduce Reduce Reduce write barrier Map Map write barrier Reduce Reduce Reduce write barrier
MapReduce is verbose
SQL Hive Impala Drill Spark SQL
Spark
Spark
Takeaways MapReduce isn t the best fit for iterative/graph processing or for interactive data discovery MapReduce should be reserved for production jobs that are a good match for that style of computing Spark, Hive and Impala offer high-level low-latency interactive access to your data
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
I need you to calculate the top 10 trending news articles in real-time
App App Kafka HBase Hadoop
there s a bug in your aggregation logic!
Vote A B Use a batch processing tier to recalculate aggregates Replay Kafka data to Storm
HBase App Kafka Camus HDFS MapReduce / Spark
Takeaways Correcting bugs and backfilling data is the dirty reality of working with streaming systems Consider writing aggregation code in a way that can be leveraged in both streaming and batch Take a look at Summingbird (a Twitter project) which implements the Lambda architecture Read Nathan Marz s book Big Data
Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin
Resident Hadoop admin Working with Hadoop for 1 year Manages a number of small Hadoop clusters https:// www.flickr.com/ photos/pitadel/ 4951801589
Your organization has to build, setup and manage a 1,000 node Hadoop cluster to support critical product features.! What should you do?
Vote A B C D Use Bob! Get Bob certified as a Hadoop admin Build a DevOps team Get a support contract
Hadoop admin requires DevOps
An average day for Hadoop DevOps Replacing failed hard drives Diagnosing a DataNode failure by correlating logs across multiple machines Debugging Hadoop code, patching their Hadoop distribution and contributing back Dealing with data scientists who are determined to bring down the cluster
Takeaways To support a mission-critical Hadoop cluster you must have (ideally more than one) DevOps engineers on hand who are familiar with Hadoop admin and patching Hadoop code A vendor support contract can be a welcome bonus Vendor will (eventually) patch your issues and roll them into their next release
Conclusion Hadoop is a powerful tool that integrates well with other systems Helps solve data integration Useful for solving many problems, but look out for anti-patterns Make sure you have support
Shameless plug Book signing at JavaOne bookstore - 1pm today! BOF3612 - Using Kafka to Optimize Data Movement and System Integration