Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Similar documents
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Implement Hadoop jobs to extract business value from large and varied data sets

ITG Software Engineering

Workshop on Hadoop with Big Data

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

COURSE CONTENT Big Data and Hadoop Training

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Course Highlights

Testing 3Vs (Volume, Variety and Velocity) of Big Data

BIG DATA - HADOOP PROFESSIONAL amron

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop IST 734 SS CHUNG

Large scale processing using Hadoop. Ján Vaňo

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

ITG Software Engineering

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Cloudera Certified Developer for Apache Hadoop

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Hadoop Job Oriented Training Agenda

BIG DATA TRENDS AND TECHNOLOGIES

A Brief Outline on Bigdata Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

HADOOP. Revised 10/19/2015

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Too Big To Ignore

Hadoop. Sunday, November 25, 12

Qsoft Inc

Complete Java Classes Hadoop Syllabus Contact No:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Move Data from Oracle to Hadoop and Gain New Business Insights

Using distributed technologies to analyze Big Data

BIG DATA What it is and how to use?

Moving From Hadoop to Spark

Big Data: Tools and Technologies in Big Data

Hadoop: The Definitive Guide

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Presenters: Luke Dougherty & Steve Crabb

Chase Wu New Jersey Ins0tute of Technology

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Big Data on Microsoft Platform

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Hadoop and Map-Reduce. Swati Gore

HDFS. Hadoop Distributed File System

I/O Considerations in Big Data Analytics

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Ecosystem B Y R A H I M A.

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Hadoop Big Data for Processing Data and Performing Workload

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Data processing goes big

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HDP Hadoop From concept to deployment.

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

BIG DATA HADOOP TRAINING

Big Data and Apache Hadoop s MapReduce

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

How To Scale Out Of A Nosql Database

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Jeffrey D. Ullman slides. MapReduce for data intensive computing

The Hadoop Eco System Shanghai Data Science Meetup

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Internals of Hadoop Application Framework and Distributed File System

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source Google-style large scale data analysis with Hadoop

Data Analyst Program- 0 to 100

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Enhancing Massive Data Analytics with the Hadoop Ecosystem

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Dominik Wagenknecht Accenture

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

How Companies are! Using Spark

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Bringing Big Data to People

Transforming the Telecoms Business using Big Data and Analytics

Openbus Documentation

Native Connectivity to Big Data Sources in MSTR 10

Transcription:

Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked, Pig Results Summary JointTechs, Summer 2011 2

Hadoop overview Hadoop Ecosystem ZOO KEEPER Coordination PIG Data Flow HIVE Batch SQL Hadoop MapReduce Job Scheduling & Raw data processing HBASE Read Time Quering HDFS Hadoop Distributed File System Unstructured Storage SQOOP Data Import AVRO Serialization Framework www.hadoop.apache.org www.cloudera.com JointTechs, Summer 2011 3

For analysts Hadoop supports Line-oriented format Large text files Uniform structure, or known format Other cases write your own Java classes; record reader, file splitter, JointTechs, Summer 2011 4

Internet2 owamp logs JointTechs, Summer 2011 5

Owamp logs Binary files with header Line format Index: int, Seqno: int SendIP: bytes, RecvIP: bytes sendts: double senderr: float, RecvErr: float Delay: float

Challenges Hadoop support for binary file format 100GB of owamp test results, small files of 300K files Solutions Avro, the right way Preprocessing binary files to.csv, the easy way Whole file reader Streaming JointTechs, Summer 2011 7

Avro, first attempt Avro is a data serialization system. Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. www.avro.apache.org JointTechs, Summer 2011 8

Avro, owamp schema { type : record, name : ow_record, fields : [ { name : index, type : int }, { name : seqno, type : int }, { name : sndip, type : int }, { name : sndport, type : int }, } { name : rcvip, type : bytes }, { name : rcvport, type : int }, { name : sndts, type : double }, { name : snderr, type : float }, { name : rcverr, type : float }, { name : delay, type : float }] JointTechs, Summer 2011 9

Pig, the working example Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. Optimization opportunities. Extensibility. www.pig.apache.org JointTechs, Summer 2011 10

Pig Latin Script structure LOAD data FILTER Performed as early as possible GROUP Keys of map-reduce GENERATE Aggregations, and evaluations STORE JointTechs, Summer 2011 11

Queries performed with Pig Count number of negative delay values an indication of clock synchronization problem Basic delay statistics min, max, mean, and variance Delay histogram similar to owstats output Different time scales JointTechs, Summer 2011 12

Time series Original key (groups): <src_domain> <dst_domain> <src_ip> <dst_ip> Value to aggregate: mean delay Time is added as part of the key by masking to a certain precision Example 1298879930000, 0.0277492 1298879940000, 0.0543387 1298879980000, 0.037 JointTechs, Summer 2011 13

Count negative delay values LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate mean delay JointTechs, Summer 2011 14

Delay histogram (owstats) Demo ready! JointTechs, Summer 2011 15

Count negative delay values JointTechs, Summer 2011 16

Count negative delay values SALT negative counts JointTechs, Summer 2011 17

Delay statistics LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step, generating delay*delay Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate statistics JointTechs, Summer 2011 18

Variance Variance formula Knowing statistics

Mean delay at different time scales JointTechs, Summer 2011 20

Mean delay at different time scales JointTechs, Summer 2011 21

Mean delay at different time scales JointTechs, Summer 2011 22

Outlier detection JointTechs, Summer 2011 23

Outlier detection JointTechs, Summer 2011 24

Performance 14 machines Fastest: 12-core, Intel(R) Core(TM) i7 CPU JointTechs, Summer 2011 25

Performance JointTechs, Summer 2011 26

Summary Trade-offs - Hadoop configuration and tweaking - data preprocessing - structured formats - support for advanced calculation + reliability and fault-tolerance + usability + speed Hadoop requires time & effort to set up, but it's worth it. JointTechs, Summer 2011 27

Future directions More Pig Latin options; UDF Other tools; HBase RHIPE http://www.stat.purdue.edu/~sguha/rhipe/ Usability study JointTechs, Summer 2011 28

Thank you!