Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked, Pig Results Summary JointTechs, Summer 2011 2

Hadoop overview Hadoop Ecosystem ZOO KEEPER Coordination PIG Data Flow HIVE Batch SQL Hadoop MapReduce Job Scheduling & Raw data processing HBASE Read Time Quering HDFS Hadoop Distributed File System Unstructured Storage SQOOP Data Import AVRO Serialization Framework www.hadoop.apache.org www.cloudera.com JointTechs, Summer 2011 3

For analysts Hadoop supports Line-oriented format Large text files Uniform structure, or known format Other cases write your own Java classes; record reader, file splitter, JointTechs, Summer 2011 4

Internet2 owamp logs JointTechs, Summer 2011 5

Owamp logs Binary files with header Line format Index: int, Seqno: int SendIP: bytes, RecvIP: bytes sendts: double senderr: float, RecvErr: float Delay: float

Challenges Hadoop support for binary file format 100GB of owamp test results, small files of 300K files Solutions Avro, the right way Preprocessing binary files to.csv, the easy way Whole file reader Streaming JointTechs, Summer 2011 7

Avro, first attempt Avro is a data serialization system. Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. www.avro.apache.org JointTechs, Summer 2011 8

Avro, owamp schema { type : record, name : ow_record, fields : [ { name : index, type : int }, { name : seqno, type : int }, { name : sndip, type : int }, { name : sndport, type : int }, } { name : rcvip, type : bytes }, { name : rcvport, type : int }, { name : sndts, type : double }, { name : snderr, type : float }, { name : rcverr, type : float }, { name : delay, type : float }] JointTechs, Summer 2011 9

Pig, the working example Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. Optimization opportunities. Extensibility. www.pig.apache.org JointTechs, Summer 2011 10

Pig Latin Script structure LOAD data FILTER Performed as early as possible GROUP Keys of map-reduce GENERATE Aggregations, and evaluations STORE JointTechs, Summer 2011 11

Queries performed with Pig Count number of negative delay values an indication of clock synchronization problem Basic delay statistics min, max, mean, and variance Delay histogram similar to owstats output Different time scales JointTechs, Summer 2011 12

Time series Original key (groups): <src_domain> <dst_domain> <src_ip> <dst_ip> Value to aggregate: mean delay Time is added as part of the key by masking to a certain precision Example 1298879930000, 0.0277492 1298879940000, 0.0543387 1298879980000, 0.037 JointTechs, Summer 2011 13

Count negative delay values LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate mean delay JointTechs, Summer 2011 14

Delay histogram (owstats) Demo ready! JointTechs, Summer 2011 15

Count negative delay values JointTechs, Summer 2011 16

Count negative delay values SALT negative counts JointTechs, Summer 2011 17

Delay statistics LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step, generating delay*delay Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate statistics JointTechs, Summer 2011 18

Variance Variance formula Knowing statistics

Mean delay at different time scales JointTechs, Summer 2011 20

Outlier detection JointTechs, Summer 2011 23

Outlier detection JointTechs, Summer 2011 24

Performance 14 machines Fastest: 12-core, Intel(R) Core(TM) i7 CPU JointTechs, Summer 2011 25

Performance JointTechs, Summer 2011 26

Summary Trade-offs - Hadoop configuration and tweaking - data preprocessing - structured formats - support for advanced calculation + reliability and fault-tolerance + usability + speed Hadoop requires time & effort to set up, but it's worth it. JointTechs, Summer 2011 27

Future directions More Pig Latin options; UDF Other tools; HBase RHIPE http://www.stat.purdue.edu/~sguha/rhipe/ Usability study JointTechs, Summer 2011 28

Thank you!