Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011



Similar documents
Workshop on Hadoop with Big Data

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Course Highlights

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

the missing log collector Treasure Data, Inc. Muga Nishizawa

Real-time Big Data Analytics with Storm

HDFS. Hadoop Distributed File System

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hadoop Job Oriented Training Agenda

COURSE CONTENT Big Data and Hadoop Training

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Scaling Out With Apache Spark. DTL Meeting Slides based on

Towards Smart and Intelligent SDN Controller


Chase Wu New Jersey Ins0tute of Technology

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Complete Java Classes Hadoop Syllabus Contact No:

Cloudera Certified Developer for Apache Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Best Practices for Hadoop Data Analysis with Tableau

Openbus Documentation

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Getting Real Real Time Data Integration Patterns and Architectures

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

HADOOP. Revised 10/19/2015

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real World Hadoop Use Cases

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

BIG DATA - HADOOP PROFESSIONAL amron

Predictive Analytics with Storm, Hadoop, R on AWS

Integrating VoltDB with Hadoop

A very short Intro to Hadoop

Big Data and Scripting map/reduce in Hadoop

Apache Flink Next-gen data analysis. Kostas

Peers Techno log ies Pv t. L td. HADOOP

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Integration of Apache Hive and HBase

The Hadoop Eco System Shanghai Data Science Meetup

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How To Scale Out Of A Nosql Database

Luncheon Webinar Series May 13, 2013

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data? Definition # 1: Big Data Definition Forrester Research

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Scalability and Design Patterns

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Large Scale Text Analysis Using the Map/Reduce

Business Intelligence for Big Data

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Integrating Big Data into the Computing Curricula

Apache Hadoop: The Big Data Refinery

Hadoop IST 734 SS CHUNG

Hadoop and Map-Reduce. Swati Gore

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data processing goes big

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop: The Definitive Guide

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Using distributed technologies to analyze Big Data

Microsoft Research Windows Azure for Research Training

What's New in SAS Data Management

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Transcription:

Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011

The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase architecture Wrap up

Flume is A distributed data transport and aggregation system for event- or log-structured data Principally designed for continuous data ingestion into Hadoop But more flexible than that Data origins a.k.a. Flume agents Click streams, etc. Aggregate data Flume collector node HDFS

Flume terminology Every machine in Flume is a node Each node has a source and a sink Example source: tail( /var/log/httpd/access_log ) Example sink: dfs( hdfs://namenode/logs/%{host}/%y%m%d ) Some sinks send data to collector nodes, which aggregate data from many agents before writing to HDFS

Flume control plane All Flume nodes heartbeat to/receive config from master Operator tools interact with the master via a Thrift API e.g., the Flume shell Nodes can be reconfigured to use different sources, sinks Agents Collector Thrift API HDFS Flume master Operator

Real-time data moves through Flume Events enter Flume within seconds of generation Hadoop MapReduce analysis runs at best once/10 minutes Desirable behavior: analyze this data on-the-fly Ad campaign cut-off Real-time personalization, recommendations Load and performance monitoring Error alerting

But Flume isn t an analytic system No ability to inspect message bodies Agents No notion of aggregates, rolling counters, etc clickstream or even filters Web backend Collector HDFS Real-time aggregates mysql

But Flume isn t an analytic system No ability to inspect message bodies Agents No notion of aggregates, rolling counters, etc clickstream or even filters Web backend Collector HDFS This leads to fascinating hacks (see right) Real-time aggregates mysql

Flume and Flexibility New sources, sinks can be added from plugins Flume topology can be dynamically reconfigured by sending commands to master over Thrift API Contents of Flume events (messages) are uninterpreted

Flume and Flexibility New sources, sinks can be added from plugins Flume topology can be dynamically reconfigured by sending commands to master over Thrift API Contents of Flume events (messages) are uninterpreted Meaning we can define new endpoints for Flume data, store arbitrary data in events, and control Flume programmatically.

FlumeBase: Online Analytics for Flume

FlumeBase server Runs persistent queries analyzing data streams Events interpreted relative to a user-specified schema, parser Transparently reconfigures source Flume nodes to tee data Acts as a Flume node Output events are just another Flume data stream

rtsql: FlumeBase s query language SQL-like language for defining event schemas, queries CREATE STREAM foo(status INT, msg STRING, priority INT) FROM NODE backend-server-5 ; SELECT * FROM foo WHERE priority > 10;

rtsql language features Lots of standard SQL features available SELECT, WHERE, GROUP BY, HAVING, JOIN Streams are infinite: GROUP BY and JOIN both use windowing to operate over rolling time windows of events Standard aggregate functions: COUNT, MIN, MAX, SUM, AVG

Demo time (Buckle your seatbelts)

Under the hood Incoming data from Flume network Client process Submits queries rtsql compiler FlumeBase Server generates Flume in node Output data printed to client console Stream dictionary Flume controller EventParser Flow operator DAG Flow Flume out node Emitted records return to Flume network

Life of a query Clients submit rtsql queries as simple strings to server Client process Submits queries rtsql compiler FlumeBase Server Incoming data from Flume network Flume in node Compiler parses query to an AST, generates a logical plan (DAG), and maps that to a DAG of physical operators ( HashJoin, Filter, etc) Output data printed to client console Stream dictionary Flume controller generates EventParser Flow operator DAG Flume out node Emitted records return to Flume network Flow

Life of a query Physical operators form a flow, which is injected into the execution thread; continuously reads from input and processes events Client process Submits queries rtsql compiler FlumeBase Server Incoming data from Flume network Flume in node generates Many flows (queries) may run in parallel. Output data printed to client console Stream dictionary Flume controller EventParser Flow operator DAG Flume out node Flow Flows must be explicitly dropped when they re no longer useful Emitted records return to Flume network

Schemas, types and serialization Event data can enter FlumeBase in any format Each stream has: A schema, specifying which fields it has, and their type An EventParser, which can extract fields from the input event Data is internally represented in Avro generic records Output events have Avro binary-encoded records for bodies

Interacting with Flume CREATE STREAM defines a schema that could be applied to the output of a Flume node or source Submitting a query against that stream requires reading from Flume The Flume controller reconfigures the upstream node to send data to FlumeBase, or hosts a new source locally Dropping a query restores the upstream node s original configuration

FlumeBase components and processes FlumeBase abstracts the server concept into an ExecEnvironment Everything can run in a single process: client shell, ExecEnvironment, even Flume nodes and master Better is to leave a long-lived FlumeBase server running and connect clients as needed to examine output, submit or modify queries

Conclusions Real-time analytics require a different system than Hadoop MapReduce Flume provides a suitable basis for an online analytic system SQL-like language allows sophisticated queries with a low learning curve

Check it out! Web site: flumebase.org (docs, blog, etc.) Binary release is batteries included with a data set + walkthrough Get the source: github.com/flumebase/flumebase 100% Apache 2.0 licensed contributors welcome! Thanks for listening! aaron@odiago.com