Architectural Considerations for Hadoop Applications

Similar documents

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

Hadoop: The Definitive Guide

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Moving From Hadoop to Spark

Hadoop Ecosystem B Y R A H I M A.

The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Real Time Data Processing using Spark Streaming

Workshop on Hadoop with Big Data

ITG Software Engineering

Upcoming Announcements

COURSE CONTENT Big Data and Hadoop Training

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop and Map-Reduce. Swati Gore

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

How To Create A Data Visualization With Apache Spark And Zeppelin

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The Hadoop Eco System Shanghai Data Science Meetup

Implement Hadoop jobs to extract business value from large and varied data sets

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Big Data Course Highlights

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Constructing a Data Lake: Hadoop and Oracle Database United!

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

HDP Hadoop From concept to deployment.

Hadoop Job Oriented Training Agenda

Luncheon Webinar Series May 13, 2013

Cloudera Certified Developer for Apache Hadoop

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Apache Sentry. Prasad Mujumdar

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Dominik Wagenknecht Accenture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data With Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

<Insert Picture Here> Big Data

The Internet of Things and Big Data: Intro

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Case Study : 3 different hadoop cluster deployments

Big Data Analytics Nokia

Internals of Hadoop Application Framework and Distributed File System

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

SQL on NoSQL (and all of the data) With Apache Drill

the missing log collector Treasure Data, Inc. Muga Nishizawa

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Parquet. Columnar storage for the people

Productionizing a 24/7 Spark Streaming Service on YARN

Hadoop & Spark Using Amazon EMR

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Bringing Big Data to People

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

HDP Enabling the Modern Data Architecture

Certified Big Data and Apache Hadoop Developer VS-1221

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

HDFS. Hadoop Distributed File System

Beyond Hadoop with Apache Spark and BDAS

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Important Notice. (c) Cloudera, Inc. All rights reserved.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Unified Big Data Processing with Apache Spark. Matei

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Apache Hadoop: Past, Present, and Future

Encryption and Anonymization in Hadoop

Comprehensive Analytics on the Hortonworks Data Platform

Fighting Cyber Fraud with Hadoop. Niel Dunnage Senior Solutions Architect

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Kafka & Redis for Big Data Solutions

HADOOP. Revised 10/19/2015

Deploying Hadoop with Manager

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Apache Flink Next-gen data analysis. Kostas

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Peers Techno log ies Pv t. L td. HADOOP

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Transcription:

Architectural Considerations for Hadoop Applications Strata+Hadoop World, San Jose February 18 th, 2015 tiny.cloudera.com/app-arch-slides Mark Grover @mark_grover Ted Malaska @TedMalaska Jonathan Seidman @jseidman Gwen Shapira @gwenshap

About the book @hadooparchbook hadooparchitecturebook.com github.com/hadooparchitecturebook slideshare.com/hadooparchbook 2014 Cloudera, Inc. All Rights Reserved. 2

About the presenters Ted Malaska Principal Solutions Architect at Cloudera Previously, lead architect at FINRA Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark Jonathan Seidman Senior Solutions Architect/Partner Enablement at Cloudera Previously, Technical Lead on the big data team at Orbitz Worldwide Co-founder of the Chicago Hadoop User Group and Chicago 2014 Cloudera, Inc. All Rights Big Reserved. 3

About the presenters Gwen Shapira Solutions Architect turned Software Engineer at Cloudera Committer on Apache Sqoop Contributor to Apache Flume and Apache Kafka Mark Grover Software Engineer at Cloudera Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume 2014 Cloudera, Inc. All Rights Reserved. 4

Logistics Break at 10:30-11:00 PM Questions at the end of each section 2014 Cloudera, Inc. All Rights Reserved. 5

Case Study Clickstream Analysis 6

Analytics 2014 Cloudera, Inc. All Rights Reserved. 7

Analytics 2014 Cloudera, Inc. All Rights Reserved. 8

Analytics 2014 Cloudera, Inc. All Rights Reserved. 9

Analytics 2014 Cloudera, Inc. All Rights Reserved. 10

Analytics 2014 Cloudera, Inc. All Rights Reserved. 11

Analytics 2014 Cloudera, Inc. All Rights Reserved. 12

Analytics 2014 Cloudera, Inc. All Rights Reserved. 13

Web Logs Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productid=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2014 Cloudera, Inc. All Rights Reserved. 14

Clickstream Analytics 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36 2014 Cloudera, Inc. All Rights Reserved. 15

Similar use-cases Sensors heart, agriculture, etc. Casinos session of a person at a table 2014 Cloudera, Inc. All Rights Reserved. 16

Pre-Hadoop Architecture Clickstream Analysis 17

Click Stream Analysis (Before Hadoop) Web logs (full fidelity) (2 weeks) Transform/ Aggregate Data Warehouse Business Intelligence Tape Archive 2014 Cloudera, Inc. All Rights Reserved. 18

Problems with Pre-Hadoop Architecture Full fidelity data is stored for small amount of time (~weeks). Older data is sent to tape, or even worse, deleted! Inflexible workflow - think of all aggregations beforehand 2014 Cloudera, Inc. All Rights Reserved. 19

Effects of Pre-Hadoop Architecture Regenerating aggregates is expensive or worse, impossible Can t correct bugs in the workflow/aggregation logic Can t do experiments on existing data 2014 Cloudera, Inc. All Rights Reserved. 20

Why is Hadoop A Great Fit? Clickstream Analysis 21

Why is Hadoop a great fit? Volume of clickstream data is huge Velocity at which it comes in is high Variety of data is diverse - semi-structured data Hadoop enables active archival of data Aggregation jobs Querying the above aggregates or raw fidelity data 2014 Cloudera, Inc. All Rights Reserved. 22

Click Stream Analysis (with Hadoop) Web logs Hadoop Business Intelligence Active archive (no tape) Aggregation engine Querying engine 2014 Cloudera, Inc. All Rights Reserved. 23

Challenges of Hadoop Implementation 2014 Cloudera, Inc. All Rights Reserved. 24

Challenges of Hadoop Implementation 2014 Cloudera, Inc. All Rights Reserved. 25

Other challenges - Architectural Considerations Storage managers? HDFS? HBase? Data storage and modeling: File formats? Compression? Schema design? Data movement How do we actually get the data into Hadoop? How do we get it out? Metadata How do we manage data about the data? Data access and processing How will the data be accessed once in Hadoop? How can we transform it? How do we query it? Orchestration How do we manage the workflow for all of this? 2014 Cloudera, Inc. All Rights Reserved. 26

Case Study Requirements Overview of Requirements 27

Overview of Requirements Data Sources Ingestion Raw Data Storage (Formats, Schema) Processing Processed Data Storage (Formats, Schema) Data Consumption Orchestration (Scheduling, Managing, Monitoring) 2014 Cloudera, Inc. All Rights Reserved. 28

Case Study Requirements Data Ingestion 29

Data Ingestion Requirements Web Servers Web Servers Web Servers Web Servers Web Servers Logs 244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 CRM Data Hadoop Web Servers Web Servers Web Servers Web Servers Web Servers Logs ODS 2014 Cloudera, Inc. All Rights Reserved. 30

Data Ingestion Requirements So we need to be able to support: Reliable ingestion of large volumes of semi-structured event data arriving with high velocity (e.g. logs). Timeliness of data availability data needs to be available for processing to meet business service level agreements. Periodic ingestion of data from relational data stores. 2014 Cloudera, Inc. All Rights Reserved. 31

Case Study Requirements Data Storage 32

Data Storage Requirements Store all the data Make the data accessible for processing Compress the data 2014 Cloudera, Inc. All Rights Reserved. 33

Case Study Requirements Data Processing 34

Processing requirements Be able to answer questions like: What is my website s bounce rate? i.e. how many % of visitors don t go past the landing page? Which marketing channels are leading to most sessions? Do attribution analysis Which channels are responsible for most conversions? 2014 Cloudera, Inc. All Rights Reserved. 35

Sessionization > 30 minutes Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 2014 Cloudera, Inc. All Rights Reserved. 36

Case Study Requirements Orchestration 37

Orchestration is simple We just need to execute actions One after another 2014 Cloudera, Inc. All Rights Reserved. 38

Actually, we also need to handle errors And user notifications. 2014 Cloudera, Inc. All Rights Reserved. 39

And Re-start workflows after errors Reuse of actions in multiple workflows Complex workflows with decision points Trigger actions based on events Tracking metadata Integration with enterprise software Data lifecycle Data quality control Reports 2014 Cloudera, Inc. All Rights Reserved. 40

OK, maybe we need a product To help us do all that 2014 Cloudera, Inc. All Rights Reserved. 41

Architectural Considerations Data Modeling 42

Data Modeling Considerations We need to consider the following in our architecture: Storage layer HDFS? HBase? Etc. File system schemas how will we lay out the data? File formats what storage formats to use for our data, both raw and processed data? Data compression formats? 2014 Cloudera, Inc. All Rights Reserved. 43

Architectural Considerations Data Modeling Storage Layer 44

Data Storage Layer Choices Two likely choices for raw data: 2014 Cloudera, Inc. All Rights Reserved. 45

Data Storage Layer Choices Stores data directly as files Fast scans Poor random reads/writes Stores data as Hfiles on HDFS Slow scans Fast random reads/writes 2014 Cloudera, Inc. All Rights Reserved. 46

Data Storage Storage Manager Considerations Incoming raw data: Processing requirements call for batch transformations across multiple records for example sessionization. Processed data: Access to processed data will be via things like analytical queries again requiring access to multiple records. We choose HDFS Processing needs in this case served better by fast scans. 2014 Cloudera, Inc. All Rights Reserved. 47

Architectural Considerations Data Modeling Raw Data Storage 48

Storage Formats Raw Data and Processed Data Raw Data Processed Data 2014 Cloudera, Inc. All Rights Reserved. 49

Data Storage Format Considerations Logs (plain text) Click to enter confidentiality information 50

Data Storage Format Considerations Logs Logs Logs (plain Logs Logs text) Logs (plain Logs (plain Logs (plain text) (plain text) (plain text) text) Logs text) Logs Logs Click to enter confidentiality information 51

Data Storage Compression Well, maybe. But not splittable. Splittable. Getting better X Splittable, but no... snappy Hmmm. Click to enter confidentiality information 52

Raw Data Storage More About Snappy Designed at Google to provide high compression speeds with reasonable compression. Not the highest compression, but provides very good performance for processing on Hadoop. Snappy is not splittable though, which brings us to 2014 Cloudera, Inc. All Rights Reserved. 53

Hadoop File Types Formats designed specifically to store and process data on Hadoop: File based SequenceFile Serialization formats Thrift, Protocol Buffers, Avro Columnar formats RCFile, ORC, Parquet Click to enter confidentiality information 54

SequenceFile Stores records as binary key/value pairs. SequenceFile blocks can be compressed. This enables splittability with nonsplittable compression. 2014 Cloudera, Inc. All Rights Reserved. 55

Avro Kinda SequenceFile on Steroids. Self-documenting stores schema in header. Provides very efficient storage. Supports splittable compression. 2014 Cloudera, Inc. All Rights Reserved. 56

Our Format Recommendations for Raw Data Avro with Snappy Snappy provides optimized compression. Avro provides compact storage, self-documenting files, and supports schema evolution. Avro also provides better failure handling than other choices. SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. But only supports Java. 2014 Cloudera, Inc. All Rights Reserved. 57

But Note For simplicity, we ll use plain text for raw data in our example. 2014 Cloudera, Inc. All Rights Reserved. 58

Architectural Considerations Data Modeling Processed Data Storage 59

Storage Formats Raw Data and Processed Data Raw Data Processed Data 2014 Cloudera, Inc. All Rights Reserved. 60

Access to Processed Data Analytical Queries Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value 2014 Cloudera, Inc. All Rights Reserved. 61

Columnar Formats Eliminates I/O for columns that are not part of a query. Works well for queries that access a subset of columns. Often provide better compression. These add up to dramatically improved performance for many queries. 1 2014-10-1 3 2 2014-10-1 4 3 2014-10-1 5 abc def ghi 1 2 3 2014-10-1 3 2014-10-1 4 abc def ghi 2014-10-1 5 2014 Cloudera, Inc. All Rights Reserved. 62

Columnar Choices RCFile Designed to provide efficient processing for Hive queries. Only supports Java. No Avro support. Limited compression support. Sub-optimal performance compared to newer columnar formats. 2014 Cloudera, Inc. All Rights Reserved. 63

Columnar Choices ORC A better RCFile. Also designed to provide efficient processing of Hive queries. Only supports Java. 2014 Cloudera, Inc. All Rights Reserved. 64

Columnar Choices Parquet Designed to provide efficient processing across Hadoop programming interfaces MapReduce, Hive, Impala, Pig. Multiple language support Java, C++ Good object model support, including Avro. Broad vendor support. These features make Parquet a good choice for our processed data. 2014 Cloudera, Inc. All Rights Reserved. 65

Architectural Considerations Data Modeling Schema Design 66

HDFS Schema Design One Recommendation /etl Data in various stages of ETL workflow /data processed data to be shared data with the entire organization /tmp temp data from tools or shared between users /user/<username> - User specific data, jars, conf files /app Everything but data: UDF jars, HQL files, Oozie workflows 2014 Cloudera, Inc. All Rights Reserved. 67

Partitioning Split the dataset into smaller consumable chunks. Rudimentary form of indexing. Reduces I/O needed to process queries. 2014 Cloudera, Inc. All Rights Reserved. 68

Partitioning Un-partitioned HDFS directory structure Partitioned HDFS directory structure dataset file1.txt file2.txt filen.txt dataset col=val1/file.txt col=val2/file.txt col=valn/file.txt 2014 Cloudera, Inc. All Rights Reserved. 69

Partitioning considerations What column to partition by? Don t have too many partitions (<10,000) Don t have too many small files in the partitions Good to have partition sizes at least ~1 GB, generally a multiple of block size. We ll partition by timestamp. This applies to both our raw and processed data. 2014 Cloudera, Inc. All Rights Reserved. 70

Partitioning For Our Case Study Raw dataset: /etl/bi/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! Processed dataset: /data/bikeshop/clickstream/year=2014/month=10/day=10! 2014 Cloudera, Inc. All Rights Reserved. 71

Architectural Considerations Data Ingestion 72

Typical Clickstream data sources Omniture data on FTP Apps App Logs RDBMS 2014 Cloudera, Inc. All rights reserved. 73

Getting Files from FTP 2014 Cloudera, Inc. All rights reserved. 74

Don t over-complicate things curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz 2014 Cloudera, Inc. All rights reserved. 75

Apache NiFi 2014 Cloudera, Inc. All rights reserved. 76

Event Streaming Flume and Kafka Reliable, distributed and highly available systems That allow streaming events to Hadoop 2014 Cloudera, Inc. All rights reserved. 77

Flume: Many available data collection sources Well integrated into Hadoop Supports file transformations Can implement complex topologies Very low latency No programming required 2014 Cloudera, Inc. All rights reserved. 78

We use Flume when: We just want to grab data from this directory and write it to HDFS 2014 Cloudera, Inc. All rights reserved. 79

Kafka is: Very high-throughput publish-subscribe messaging Highly available Stores data and can replay Can support many consumers with no extra latency 2014 Cloudera, Inc. All rights reserved. 80

Use Kafka When: Kafka is awesome. We heard it cures cancer 2014 Cloudera, Inc. All rights reserved. 81

Actually, why choose? Use Flume with a Kafka Source Allows to get data from Kafka, run some transformations write to HDFS, HBase or Solr 2014 Cloudera, Inc. All rights reserved. 82

In Our Example We want to ingest events from log files Flume s Spooling Directory source fits With HDFS Sink We would have used Kafka if We wanted the data in non-hadoop systems too 2014 Cloudera, Inc. All rights reserved. 83

Short Intro to Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate DR, critical Memory, file, Kafka HDFS, HBase, Solr Sources Interceptors Selectors Channels Sinks Flume Agent 84

Configuration Declarative No coding required. Configuration specifies how components are wired together. 2014 Cloudera, Inc. All Rights Reserved. 85

Interceptors Mask fields Validate information against external source Extract fields Modify data format Filter or split events 2014 Cloudera, Inc. All rights reserved. 86

Any sufficiently complex configuration Is indistinguishable from code 2014 Cloudera, Inc. All rights reserved. 87

A Brief Discussion of Flume Patterns Fanin Flume agent runs on each of our servers. These client agents send data to multiple agents to provide reliability. Flume provides support for load balancing. 2014 Cloudera, Inc. All Rights Reserved. 88

A Brief Discussion of Flume Patterns Splitting Common need is to split data on ingest. For example: Sending data to multiple clusters for DR. To multiple destinations. Flume also supports partitioning, which is key to our implementation. 2014 Cloudera, Inc. All Rights Reserved. 89

Flume Architecture Client Tier Web Server Flume Agent Flume Agent Web Logs Spooling Dir Source Timestamp Interceptor File Channel Avro Sink Avro Sink Avro Sink 90

Flume Architecture Collector Tier Flume Agent Flume Agent Avro Source File Channel HDFS Sink HDFS Sink HDFS 91

What if. We were to use Kafka? Add Kafka producer to our webapp Send clicks and searches as messages Flume can ingest events from Kafka We can add a second consumer for real-time processing in SparkStreaming Another consumer for alerting And maybe a batch consumer too 2014 Cloudera, Inc. All rights reserved. 92

The Kafka Channel Kafka HDFS, HBase, Solr Producer A Producer B Producer C Kafka Producers Channels Sinks Flume Agent 93

The Kafka Channel Twitter, logs, JMS, webserver Mask, re-format, validate Kafka Consumer A Consumer B Consumer C Sources Interceptors Channels Kafka Consumers Flume Agent 94

The Kafka Channel Twitter, logs, JMS, webserver Mask, re-format, validate DR, critical Kafka HDFS, HBase, Solr Sources Interceptors Selectors Channels Sinks Flume Agent 95

Architectural Considerations Data Processing Engines tiny.cloudera.com/app-arch-slides 96

Processing Engines MapReduce Abstractions Spark Spark Streaming Impala Confidentiality Information Goes Here 97

MapReduce Oldie but goody Restrictive Framework / Innovated Work Around Extreme Batch Confidentiality Information Goes Here 98

MapReduce Basic High Level Mapper Reducer HDFS (Replicated) Block of Data Output File Native File System Temp Spill Data Partitioned Sorted Data Reducer Local Copy Confidentiality Information Goes Here 99

MapReduce Innovation Mapper Memory Joins Reducer Memory Joins Buckets Sorted Joins Cross Task Communication Windowing And Much More Confidentiality Information Goes Here 100

Abstractions SQL Hive Script/Code Pig: Pig Latin Crunch: Java/Scala Cascading: Java/Scala Confidentiality Information Goes Here 101

Spark The New Kid that isn t that New Anymore Easily 10x less code Extremely Easy and Powerful API Very good for machine learning Scala, Java, and Python RDDs DAG Engine Confidentiality Information Goes Here 102

Spark - DAG Confidentiality Information Goes Here 103

Spark - DAG TextFile Filter KeyBy Join Filter Take TextFile KeyBy Confidentiality Information Goes Here 104

Spark - DAG TextFile Good Filter Good KeyBy Good-Replay Good Good Good-Replay Good Good Good-Replay Join Filter Take Lost Block Future Future Good Future Future TextFile Good KeyBy Good-Replay Good-Replay Good Lost Block Replay Good-Replay Confidentiality Information Goes Here 105

Spark Streaming Calling Spark in a Loop Extends RDDs with DStream Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here 106

Spark Streaming Pre-first Batch Source Receiver RDD First Batch Source Receiver RDD Single Pass Filter Count Print RDD Source Receiver RDD Second Batch RDD Single Pass Filter Count Print RDD Confidentiality Information Goes Here 107

Spark Streaming Pre-first Batch Source Receiver RDD First Batch Source Receiver RDD Filter Single Pass Count RDD Stateful RDD 1 Print Source Receiver RDD Stateful RDD 1 Second Batch RDD Filter Single Pass Count RDD Stateful RDD 2 Print Confidentiality Information Goes Here 108

Impala MPP Style SQL Engine on top of Hadoop Very Fast High Concurrency Analytical windowing functions (C5.2). Confidentiality Information Goes Here 109

Impala Broadcast Join Impala Daemon Impala Daemon Impala Daemon 100% Cached Smaller Table 100% Cached Smaller Table 100% Cached Smaller Table Smaller Table Data Block Smaller Table Data Block Output Output Output Impala Daemon Hash Join Function 100% Cached Smaller Table Impala Daemon Hash Join Function 100% Cached Smaller Table Impala Daemon Hash Join Function 100% Cached Smaller Table Bigger Table Data Block Bigger Table Data Block Bigger Table Data Block Confidentiality Information Goes Here 110

Impala Partitioned Hash Join Impala Daemon Impala Daemon Impala Daemon Hash Partitioner ~33% Cached Smaller Table Hash Partitioner ~33% Cached Smaller Table ~33% Cached Smaller Table Smaller Table Data Block Smaller Table Data Block Output Output Output Hash Partitioner Impala Daemon Hash Join Function 33% Cached Smaller Table Hash Partitioner Impala Daemon Hash Join Function 33% Cached Smaller Table Hash Partitioner Impala Daemon Hash Join Function 33% Cached Smaller Table BiggerTable Data Block BiggerTable Data Block BiggerTable Data Block Confidentiality Information Goes Here 111

Impala vs Hive Very different approaches and We may see convergence at some point But for now Impala for speed Hive for batch Confidentiality Information Goes Here 112

Architectural Considerations Data Processing Patterns and Recommendations 113

What processing needs to happen? Sessionization Filtering Deduplication BI / Discovery Confidentiality Information Goes Here 114

Sessionization > 30 minutes Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 Confidentiality Information Goes Here 115

Why sessionize? Helps answers questions like: What is my website s bounce rate? i.e. how many % of visitors don t go past the landing page? Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) Do attribution analysis which channels are responsible for most conversions? Confidentiality Information Goes Here 116

Sessionization 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 244.157.45.12+1413583199 Confidentiality Information Goes Here 117

How to Sessionize? 1. Given a list of clicks, determine which clicks came from the same user (Partitioning, ordering) 2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session (Identifying session boundaries) Confidentiality Information Goes Here 118

#1 Which clicks are from same user? We can use: IP address (244.157.45.12) Cookies (A9A3BECE0563982D) IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") 2014 Cloudera, Inc. All Rights Reserved. 119

#1 Which clicks are from same user? We can use: IP address (244.157.45.12) Cookies (A9A3BECE0563982D) IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") 2014 Cloudera, Inc. All Rights Reserved. 120

#1 Which clicks are from same user? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2014 Cloudera, Inc. All Rights Reserved. 121

#2 Which clicks part of the same session? > 30 mins apart = different sessions 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2014 Cloudera, Inc. All Rights Reserved. 122

Sessionization engine recommendation We have sessionization code in MR and Spark on github. The complexity of the code varies, depends on the expertise in the organization. We choose MR MR API is stable and widely known No Spark + Oozie (orchestration engine) integration currently 2014 Cloudera, Inc. All rights reserved. 123

Filtering filter out incomplete records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U 2014 Cloudera, Inc. All Rights Reserved. 124

Filtering filter out records from bots/spiders 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 Google spider IP address 2014 Cloudera, Inc. All Rights Reserved. 125

Filtering recommendation Bot/Spider filtering can be done easily in any of the engines Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. Flume interceptors can also be used Pretty close choice between MR, Hive and Spark Can be done in Spark using rdd.filter() We can simply embed this in our MR sessionization job 2014 Cloudera, Inc. All rights reserved. 126

Deduplication remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36 2014 Cloudera, Inc. All Rights Reserved. 127

Deduplication recommendation Can be done in all engines. We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication reduce() in spark We use Pig 2014 Cloudera, Inc. All rights reserved. 128

BI/Discovery engine recommendation Main requirements for this are: Low latency SQL interface (e.g. JDBC/ODBC) Users don t know how to code We chose Impala It s a SQL engine Much faster than other engines Provides standard JDBC/ODBC interfaces 2014 Cloudera, Inc. All rights reserved. 129

End-to-end processing BI tools Deduplication Filtering Sessionization 2014 Cloudera, Inc. All rights reserved. 130

Architectural Considerations Orchestration 131

Orchestrating Clickstream Data arrives through Flume Triggers a processing event: Sessionize Enrich Location, marketing channel Store as Parquet Each day we process events from the previous day 132

Choosing Right Workflow is fairly simple Need to trigger workflow based on data Be able to recover from errors Perhaps notify on the status And collect metrics for reporting 2014 Cloudera, Inc. All rights reserved. 133

Oozie or Azkaban? 2014 Cloudera, Inc. All rights reserved. 134

Oozie Architecture 2014 Cloudera, Inc. All rights reserved. 135

Oozie features Part of all major Hadoop distributions Hue integration Built -in actions Hive, Sqoop, MapReduce, SSH Complex workflows with decisions Event and time based scheduling Notifications SLA Monitoring REST API 2014 Cloudera, Inc. All rights reserved. 136

Oozie Drawbacks Overhead in launching jobs Steep learning curve XML Workflows 2014 Cloudera, Inc. All rights reserved. 137

Azkaban Architecture Client Hadoop Azkaban Web Server Azkaban Executor Server HDFS viewer plugin Job types plugin MySQL 2014 Cloudera, Inc. All rights reserved. 138

Azkaban features Simplicity Great UI including pluggable visualizers Lots of plugins Hive, Pig Reporting plugin 2014 Cloudera, Inc. All rights reserved. 139

Azkaban Limitations Doesn t support workflow decisions Can t represent data dependency 2014 Cloudera, Inc. All rights reserved. 140

Choosing Workflow is fairly simple Need to trigger workflow based on data Be able to recover from errors Perhaps notify on the status And collect metrics for reporting Easier in Oozie 2014 Cloudera, Inc. All rights reserved. 141

Choosing the right Orchestration Tool Workflow is fairly simple Need to trigger workflow based on data Be able to recover from errors Perhaps notify on the status And collect metrics for reporting Better in Azkaban 2014 Cloudera, Inc. All rights reserved. 142

Important Decision Consideration! The best orchestration tool is the one you are an expert on 2014 Cloudera, Inc. All rights reserved. 143

Orchestration Patterns Fan Out 2014 Cloudera, Inc. All rights reserved. 144

Capture & Decide Pattern 2014 Cloudera, Inc. All rights reserved. 145

Putting It All Together Final Architecture 146

Final Architecture High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processing Processed Data Storage (Formats, Schema) Data Consumption Orchestration (Scheduling, Managing, Monitoring) 2014 Cloudera, Inc. All Rights Reserved. 147

Final Architecture High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processing Processed Data Storage (Formats, Schema) Data Consumption Orchestration (Scheduling, Managing, Monitoring) 2014 Cloudera, Inc. All Rights Reserved. 148

Final Architecture Ingestion/Storage Web Server Flume Agent Fan-in Pattern Flume Agent /etl/bi/casualcyclist/ clicks/rawlogs/ year=2014/month=10/ day=10! Web Server Flume Agent Web Server Flume Agent Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent HDFS Web Server Flume Agent Web Server Flume Agent Flume Agent Multi Agents for Failover and rolling restarts 2014 Cloudera, Inc. All Rights Reserved. 149

Final Architecture High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processin g Processed Data Storage (Formats, Schema) Data Consumption Orchestration (Scheduling, Managing, Monitoring) 2014 Cloudera, Inc. All Rights Reserved. 150

Final Architecture Processing and Storage /etl/bi/ casualcyclist/ clicks/rawlogs/ year=2014/ month=10/day=10 dedup->filtering- >sessionization parquetize /data/bikeshop/ clickstream/ year=2014/ month=10/day=10 2014 Cloudera, Inc. All Rights Reserved. 151

Final Architecture High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processing Processed Data Storage (Formats, Schema) Data Consumption Orchestration (Scheduling, Managing, Monitoring) 2014 Cloudera, Inc. All Rights Reserved. 152

Final Architecture Data Access BI/ Analytics Tools JDBC/ODBC Hive/ Impala Sqoop DWH DB import tool Local Disk R, etc. 2014 Cloudera, Inc. All Rights Reserved. 153

Demo 2014 Cloudera, Inc. All rights reserved. 154

Join the Discussion Get community help or provide feedback cloudera.com/community Confidentiality Information Goes Here 155

Try Cloudera Live today cloudera.com/live 156

BOOK SIGNINGS THEATER SESSIONS Visit us at Booth #809 HIGHLIGHTS: Apache Kafka is now fully supported with Cloudera Learn why Cloudera is the leader for security in Hadoop TECHNICAL DEMOS GIVEAWAYS 157

Free books and Lots of Q&A! Ask Us Anything! Feb 19 th, 1:30-2:10PM in 211 B Book signings Feb 19 th, 11:15PM in Expo Hall Cloudera Booth (#809) Feb 19 th, 3:00PM in Expo Hall - O'Reilly Booth Stay in touch! @hadooparchbook hadooparchitecturebook.com slideshare.com/hadooparchbook 2014 Cloudera, Inc. All Rights Reserved. 158