Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL
Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked, Pig Results Summary JointTechs, Summer 2011 2
Hadoop overview Hadoop Ecosystem ZOO KEEPER Coordination PIG Data Flow HIVE Batch SQL Hadoop MapReduce Job Scheduling & Raw data processing HBASE Read Time Quering HDFS Hadoop Distributed File System Unstructured Storage SQOOP Data Import AVRO Serialization Framework www.hadoop.apache.org www.cloudera.com JointTechs, Summer 2011 3
For analysts Hadoop supports Line-oriented format Large text files Uniform structure, or known format Other cases write your own Java classes; record reader, file splitter, JointTechs, Summer 2011 4
Internet2 owamp logs JointTechs, Summer 2011 5
Owamp logs Binary files with header Line format Index: int, Seqno: int SendIP: bytes, RecvIP: bytes sendts: double senderr: float, RecvErr: float Delay: float
Challenges Hadoop support for binary file format 100GB of owamp test results, small files of 300K files Solutions Avro, the right way Preprocessing binary files to.csv, the easy way Whole file reader Streaming JointTechs, Summer 2011 7
Avro, first attempt Avro is a data serialization system. Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. www.avro.apache.org JointTechs, Summer 2011 8
Avro, owamp schema { type : record, name : ow_record, fields : [ { name : index, type : int }, { name : seqno, type : int }, { name : sndip, type : int }, { name : sndport, type : int }, } { name : rcvip, type : bytes }, { name : rcvport, type : int }, { name : sndts, type : double }, { name : snderr, type : float }, { name : rcverr, type : float }, { name : delay, type : float }] JointTechs, Summer 2011 9
Pig, the working example Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. Optimization opportunities. Extensibility. www.pig.apache.org JointTechs, Summer 2011 10
Pig Latin Script structure LOAD data FILTER Performed as early as possible GROUP Keys of map-reduce GENERATE Aggregations, and evaluations STORE JointTechs, Summer 2011 11
Queries performed with Pig Count number of negative delay values an indication of clock synchronization problem Basic delay statistics min, max, mean, and variance Delay histogram similar to owstats output Different time scales JointTechs, Summer 2011 12
Time series Original key (groups): <src_domain> <dst_domain> <src_ip> <dst_ip> Value to aggregate: mean delay Time is added as part of the key by masking to a certain precision Example 1298879930000, 0.0277492 1298879940000, 0.0543387 1298879980000, 0.037 JointTechs, Summer 2011 13
Count negative delay values LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate mean delay JointTechs, Summer 2011 14
Delay histogram (owstats) Demo ready! JointTechs, Summer 2011 15
Count negative delay values JointTechs, Summer 2011 16
Count negative delay values SALT negative counts JointTechs, Summer 2011 17
Delay statistics LOAD dataset Filter by negative numbers Project only relevant values, masking time at the same step, generating delay*delay Group by key, <src_domain> <dst_domain> <src_ip> <dst_ip> Aggregate statistics JointTechs, Summer 2011 18
Variance Variance formula Knowing statistics
Mean delay at different time scales JointTechs, Summer 2011 20
Mean delay at different time scales JointTechs, Summer 2011 21
Mean delay at different time scales JointTechs, Summer 2011 22
Outlier detection JointTechs, Summer 2011 23
Outlier detection JointTechs, Summer 2011 24
Performance 14 machines Fastest: 12-core, Intel(R) Core(TM) i7 CPU JointTechs, Summer 2011 25
Performance JointTechs, Summer 2011 26
Summary Trade-offs - Hadoop configuration and tweaking - data preprocessing - structured formats - support for advanced calculation + reliability and fault-tolerance + usability + speed Hadoop requires time & effort to set up, but it's worth it. JointTechs, Summer 2011 27
Future directions More Pig Latin options; UDF Other tools; HBase RHIPE http://www.stat.purdue.edu/~sguha/rhipe/ Usability study JointTechs, Summer 2011 28
Thank you!