The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @



Similar documents
Using Kafka to Optimize Data Movement and System Integration. Alex

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Information Builders Mission & Value Proposition

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Dominik Wagenknecht Accenture

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How Companies are! Using Spark

Upcoming Announcements

Big Data and Industrial Internet

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

HDP Hadoop From concept to deployment.

Putting Apache Kafka to Use!

Big Data Course Highlights

The Future of Data Management

Building Scalable Big Data Pipelines

How To Create A Data Visualization With Apache Spark And Zeppelin

Moving From Hadoop to Spark

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

L1: Introduction to Hadoop

COURSE CONTENT Big Data and Hadoop Training

Qsoft Inc

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Case Study : 3 different hadoop cluster deployments

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chase Wu New Jersey Ins0tute of Technology

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Large scale processing using Hadoop. Ján Vaňo

Certified Big Data and Apache Hadoop Developer VS-1221

All You Wanted to Know About Big Data Projects Chida Jan 2014

Unified Big Data Processing with Apache Spark. Matei

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Hadoop Big Data for Processing Data and Performing Workload

Workshop on Hadoop with Big Data

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Hadoop implementation of MapReduce computational model. Ján Vaňo

Comprehensive Analytics on the Hortonworks Data Platform

A very short Intro to Hadoop

Ali Ghodsi Head of PM and Engineering Databricks

Luncheon Webinar Series May 13, 2013

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Sujee Maniyam, ElephantScale

Analytics on Spark &

Big Data With Hadoop

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop: The Definitive Guide

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

The Internet of Things and Big Data: Intro

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Hadoop: Embracing future hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Distributed File System (HDFS) Overview

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Open source Google-style large scale data analysis with Hadoop

Hadoop in the Enterprise

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data Analytics - Accelerated. stream-horizon.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Deploying Hadoop with Manager

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Communicating with the Elephant in the Data Center

Oracle Big Data SQL Technical Update

#TalendSandbox for Big Data

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

CSE-E5430 Scalable Cloud Computing Lecture 2

<Insert Picture Here> Big Data

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Real Time Big Data Processing

Self-service BI for big data applications using Apache Drill

HADOOP MOCK TEST HADOOP MOCK TEST II

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Unified Big Data Analytics Pipeline. 连 城

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop & its Usage at Facebook

Understanding Hadoop Performance on Lustre

Oracle Big Data Fundamentals Ed 1 NEW

Transcription:

The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @

whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com

what s hadoop... and what can i do with it?

Jonathan Khoo via Flickr

Format Look at 7 real-world problems Each problem will have different ways they can be solved You get to vote on your favorite choice I ll discuss and suggest best practices

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

Hadoop in 1 slide Stream processing Predictive analytics YARN applications SQL DAG Graph MapReduce Hadoop, the kernel for big data YARN (resource scheduler) HDFS (distributed filesystem)

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

I need OUR ORACLE DATA TO BE LOADED INTO HADOOP FOR ANALYTICS

Vote A B Build an ETL solution that I can customize to my needs Research and use an third-party solution

Sqoop Custom ETL Hadoop Problems you ll need to solve: Reliability/fault tolerance Scalability/throughput Handle large tables Throttle network IO Idempotent writes Scheduling

OH, AND WE NEED To use the same data to calculate aggregates in real-time

Sqoop Hadoop? NoSQL

Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort Rec. Engine Analytics Security Search Monitoring Social Graph

Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort kafka Rec. Engine Analytics Security Search Monitoring Social Graph

Takeaways Avoid DIY solutions if possible Use Sqoop for relational exports/imports to Hadoop http://sqoop.apache.org/ Adopt Kafka for general purpose data integration http://kafka.apache.org/

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

Save downloaded webpages in HDFS Crawler Crawler Web Crawler Hadoop Crawler Crawler

Vote A B C Write a separate file for each URL. Buffer and write coalesced records to 1MB files. Buffer and write coalesced records to 1GB files.

NameNode

600 bytes in memory

10 9 files ~= 60GB RAM

Small files fallout Heap pressure on NameNodes Hard to fix once you discover problem Performance anti-pattern - HDFS designed for large files

Solutions Coalesce records as you write Compact small files https://github.com/alexholmes/ hdfscompact HDFS Federation Explore MapR s Hadoop distribution

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

Hadoop

Vote A B C Partition your data by date. Write all the records into a single directory. A single HDFS directory doesn t support large data volumes, so design writes so that each directory doesn t exceed 1TB.

Hadoop discarding DWH research

Partition your data Understand access patterns Talk to your users about how they ll use the data At a minimum partition your data by date

Partitioning hdfs:/data/tweets/date=20140929/ hdfs:/data/tweets/date=20140930/ hdfs:/data/tweets/date=20140931/

Takeaways Full tablescans are expensive in Hadoop Talk to your users about access patterns Partition your data, at a minimum using the date

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

shuffle Map Map Map Map 1TB Reduce Reduce HDFS HDFS Map Reduce HDFS 223MB/s Map = 74 mins

Calculate the minimum value for each word Input dataset banana 1 banana 7 Min value for banana is 1 banana 2

Identity Mapper

Calculate the minimum value

Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2

Input dataset banana 1 banana 7 banana 2 Voting choices A. Integer.MAX_VALUE B. 7 C. 1 D. 2

banana 1 banana 7 banana 2

The fix

Takeaways MapReduce tries to reduce GC by reusing Writable objects Never store the reference to the Writable in the reduce iterator Rookie mistake in MapReduce (I make it all the time)

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

why ARE the number of disengaged users on an upward trend?... i want results today!

Activities you ll need to perform Execute low-latency queries Accounts Rapidly work with your code in a shell Events Iterative processing Connections

Vote A B C MapReduce SQL-on-Hadoop Another tool

MapReduce!= low-latency

Replicated disk barriers Map Map Map Map write barrier write barrier Reduce Reduce Reduce write barrier Reduce Reduce Reduce Map Map write barrier write barrier Reduce Reduce Reduce write barrier Map Map write barrier Reduce Reduce Reduce write barrier

MapReduce is verbose

SQL Hive Impala Drill Spark SQL

Spark

Spark

Takeaways MapReduce isn t the best fit for iterative/graph processing or for interactive data discovery MapReduce should be reserved for production jobs that are a good match for that style of computing Spark, Hive and Impala offer high-level low-latency interactive access to your data

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

I need you to calculate the top 10 trending news articles in real-time

App App Kafka HBase Hadoop

there s a bug in your aggregation logic!

Vote A B Use a batch processing tier to recalculate aggregates Replay Kafka data to Storm

HBase App Kafka Camus HDFS MapReduce / Spark

Takeaways Correcting bugs and backfilling data is the dirty reality of working with streaming systems Consider writing aggregation code in a way that can be leveraged in both streaming and batch Take a look at Summingbird (a Twitter project) which implements the Lambda architecture Read Nathan Marz s book Big Data

Agenda Data integration and storage Study 1: Data ingest Study 2: Files in HDFS Study 3: Data organization Study 4: MapReduce objects Computation Study 5: Data exploration Study 6: Stream processing Administration Study 7: Hadoop admin

Resident Hadoop admin Working with Hadoop for 1 year Manages a number of small Hadoop clusters https:// www.flickr.com/ photos/pitadel/ 4951801589

Your organization has to build, setup and manage a 1,000 node Hadoop cluster to support critical product features.! What should you do?

Vote A B C D Use Bob! Get Bob certified as a Hadoop admin Build a DevOps team Get a support contract

Hadoop admin requires DevOps

An average day for Hadoop DevOps Replacing failed hard drives Diagnosing a DataNode failure by correlating logs across multiple machines Debugging Hadoop code, patching their Hadoop distribution and contributing back Dealing with data scientists who are determined to bring down the cluster

Takeaways To support a mission-critical Hadoop cluster you must have (ideally more than one) DevOps engineers on hand who are familiar with Hadoop admin and patching Hadoop code A vendor support contract can be a welcome bonus Vendor will (eventually) patch your issues and roll them into their next release

Conclusion Hadoop is a powerful tool that integrates well with other systems Helps solve data integration Useful for solving many problems, but look out for anti-patterns Make sure you have support

Shameless plug Book signing at JavaOne bookstore - 1pm today! BOF3612 - Using Kafka to Optimize Data Movement and System Integration