Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Size: px

Start display at page:

Download "Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee"

Amberly Holt
8 years ago
Views:

1 Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, Chungnam National University, Korea January 8, 2013 FloCon 2013

2 Contents Introduction Overview Hadoop-based traffic processing tool Evaluation Summary

3 INTRODUCTION

4 Internet Measurement Challenges Scalability Fault-tolerant system Extensibility CAIDA data Capture, Curation, Storage, Search, Sharing, Analysis, and Visualization Ark topology: 1.8 TB Telescope: 102 TB Packet headers: 18.8 TB Josh Polterock, CAIDA: A Data Sharing Case Study, Security at the Cyber Border: Exploring Cybersecurity for International Research Network Connections workshop,

8 TB Telescope: 102 TB Packet headers: 18.

5 Harness Distributed Computing and Storage? Google MapReduce, PB sorting by Google 2008: 6 hours and 2 minutes on 4,000 computers 2011: 33 minutes on 8000 computers 2011: 10PB, 8000 computers, 6 hours and 27 minutes Apache Hadoop project 5

and 2 minutes on 4,000 computers 2011: 33 minutes on 8000

6 Our Proposal Hadoop-based Traffic Measurement and Analysis Platform Administrator NetFlow v5 Packet Web Visualizer / Hive Slave Master Traffic Collector Pcap I/O Traffic Analyzer Traffic Analysis Mapper & Reducer Bin I/O NetFlow I/O HDFS Hadoop 1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan Yeonhee Lee and Youngseok Lee A Hadoop-based Packet Trace Processing Tool, TMA, April Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec,

Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR),

7 Related Work Traffic analysis of DNS root server (RIPE, ) PacketPig ( ) - Big Data Security Analytics platform Sherpasurfing Open Source Cyber Security Solution, Hadoop World 2011 Firewall/IDS logs, netflow/packet Performing Network and Security Analytics with Hadoop, (Travis Dawson, Narus), Hadoop Summit 2012 Distributed Bro (IDS) 7

Solution, Hadoop World 2011 Firewall/IDS logs, netflow/packet Performing Network and

8 OVERVIEW

9 Hadoop-based NetFlow Analysis Collect & Anlaysis NetFlow NetFlow Distributer 9

10 Traffic Analyzer Scan IP query Hive QL Query for Traffic Analysis Spoofed IP query Heavy User query User-defined query User Interface Packet NetFlow Traffic Collector & Loader IP analysis MR Pcap InputFormat MapReduce for Traffic Analysis TCP analysis MR HTTP analysis MR IO formats Binary Input/OutputFormat DDoS analysis MR NetFlow analysis MR Text Input/OutputFormat Web UI CLI monitor query HDFS Hadoop Data Source (Jpcap, HDFS) Data Processing (HDFS, MapReduce, Hive) User Interface (Hive, Web) Distributer 10

MR HTTP analysis MR IO formats Binary Input/OutputFormat DDoS analysis MR NetFlow analysis MR Text Input/OutputFormat Web UI CLI

11 HADOOP-BASED TRAFFIC ANALYSIS

12 Challenges 1. Data handing issue in HDFS 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop Scalability (~TB/PB) Fault tolerance Distributed computation 12

13 Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop testbed 13

14 Block-level Parallelism block-level processing file-level processing HDFS Block2 (64 MB) HDFS Block3 (64 MB) B AD 4C 38 A C C FF FF B B AD 4C 2B 1C C C

2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF 00 21 B5

15 Block-level IO vs. File-level IO Completion Time (min) SpeedUp IP Analysis_blockIO IP Analysis_fileIO SpeedUp vs fileio # of nodes 15

60 40 20 3.5 3.9 4.3 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.

16 Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop testbed 16

17 Aggregation DistributedCache Filtering Rule cnu;srcip= Aggregation Rule as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr csubnet;dstsubnet;srcport;dstport; IP/UDP packet NetFlow v5 header v5 record packet identification decoding filtering group-key generation K: time AS V: count aggregation counts per AS Port # of octets # of packets # of Flows Protocol # of octets # of packets # of Flows v5 record IP/UDP packet NetFlow v5 header v5 record packet identification decoding filtering group-key generation K: time AS V: count aggregation counts per AS AS # of octets # of packets # of Flows Subnet # of octets # of packets # of Flows HDFS Block IO Map Phase Reduce Phase HDFS 17

decoding filtering group-key generation K: time AS V: count aggregation counts per AS Port # of octets # of packets # of Flows Protocol # of octets # of packets # of Flows v5

18 Anomaly Detection DistributedCache Detection Rules port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1- ;srcip,dstip;srcip,dstip;syn-fin=6- IP/UDP packet NetFlow v5 header v5 record packet identification decoding pattern matching group-key generation partitioning &group sort aggregation detection syn_flood v5 record IP/UDP packet NetFlow v5 header v5 record packet identification decoding pattern matching group-key generation partitioning &group sort aggregation detection PortScan HDFS Block IO Map Phase Shuffle &Sort Reduce Phase HDFS 18

partitioning &group sort aggregation detection syn_flood 3.

19 Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop 19

20 Performance Tuning Configuration Hadoop IO Buffer (128K 1 MB) Java heap space (300 MB MB) # of MapReduce Slots ( # of cores) MapReduce Algorithm normal combiner vs inmapper combiner Job scheduling 20

21 Job Scheduling Different job types Periodic jobs (for monitoring) guaranteed service within time e.g Aggregated Statistics for monitoring, Flow Parse job for analytics Small ad-hoc query job (for analytics) fast response time 5 munites 5 munites 5 munites 5 munites Collector Collect Collect Collect FIFO Scheduling Basic Statistics Flow Parse ad-hoc query Basic Statistics Flow Parse ad-hoc query Basic Statistics Flow Parse ad-hoc query Fair Scheduler Basic Statistics Flow Parse Basic Statistics Flow Parse Basic Statistics Flow Parse ad-hoc query ad-hoc query ad-hoc query periodic job ad-hoc job 21

22 PERFORMANCE EVALUATION

23 Experiments Testbed Type Nodes Cores CPU Memory HardDisk Rack Small GHz 8 core 16 GB 2 TB 1 Rack Medium GHz 8 core 16 GB 4 TB 1 Rack Large GHz 2 core 2 GB 500 GB 4 Racks Data and MapReduce jobs Type Dataset MapReduce Job Testbed NetFlow 1 TB from KOREN flowstats, flowdetect, flowprint Small Packet 1 ~ 5 TB from CNU campus N/W IP, TCP, Web (webpop, User Behavior, DDoS) Medium, Large 23

24 NetFlow: SpeedUp (vs. Flowtools) SpeedUp # of nodes flowstats vs. Flowtools flowprint vs. FlowTools > FlowPrint flow-cat -p flowfile flow-print f14 > FlowStats flow-cat -p flowfile flow-stat -f12 flow-cat -p flowfile flow-stat f5 24

25 NetFlow: Scalability Job Completion Time (min) flowstats flowdetect SpeedUp (vs 5 nodes) flowstats flowdetect # of nodes # of nodes 25

26 NetFlow: Pattern Matching Result # of records (M) NetFlows Record Distribution worm.sasser,w32.sasser remote_administrator vnc w32.witty.worm count /29/2012 9/8/2012 9/18/2012 9/28/ /8/ /18/ /28/2012 date worm.opasoft,w32.opaserv.worm code_red_worm netfairy kamun emule shockwave_killer worm.killmsblast,w32.nachi.worm,w32.welchia.w orm 26

27 Packet: ScaleOut 140 Completion Time (min) IP Analysis (min) TCP Analysis (min) WebPop (min) UserBehavior (min) DDoS (min) Throughput (Gbps) IP Analysis (Gbps) TCP Analysis (Gbps) WebPop (Gbps) UserBehavior (Gbps) DDos (Gbps) # of nodes 27

28 Packet: SizeUp (30 nodes) Completion Time (sec) Throughput (Gbps) IP Analysis IP Analysis_ripe TCPStats Webpop UserBehavior DDoS IP Analysis TCPStats Webpop 0 1TB 2TB 3TB 4TB 5TB 0 UserBehavior DDoS Data Size 28

29 Packet: SizeUp (200 nodes) Completion Time (sec) TB 2TB 3TB 4TB 5TB Data Size Throughput (Gbps) IP Analysis TCP Analysis Webpop UserBehavior DDoS IP Analysis TCP Analysis Webpop UserBehavior DDoS 29

30 SUMMARY

31 Summary NetFlow analysis with Hadoop NetFlow v5 processing module MapReduce algorithms: statistics Distributed computing and storage with Hadoop Fits Internet measurement application Scalability Source codes are available at Packet, NetFlow h/hadoop 31

32 Ongoing Work Distributed real-time monitoring Rule matching for Streamed NetFlow Developing rule for MapReduce Rule classification for dedicated rule matching Scalable collection E.g.) 10GE 10 X 1 GE HDFS Productivity Integration Streaming packages Enhanced analytics Data mining: Mahout Machine learning Performance Pig RHive RHadoop Hive MapReduce HDFS Rhipe Maho ut 32

33 Reference Papers 1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace Processing Tool," The Third TMA, April Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec, 2011 Software

34 THANK YOU!

35 35 0% 20% 40% 60% 80% 100% no Usage(%) Tics IP Analysis mem_usage disk_read_usage disk_write_usage cpu_usage net_in_usage net_out_usage 0% 20% 40% 60% 80% 100% Usage(%) Tics TCP Analysis mem_usage disk_read_usage disk_write_usage cpu_usage net_in_usage net_out_usage

36 Collector TaskTracker Datanode DBMS Collector TaskTracker Datanode LoadBalancer PF_RING Collector PF_RING,RAID TaskTracker Datanode JobTracker NameNode 36

37 Internet load balancer network interface PF_RING /RSS Input Format Mapper Reducer table records visualized statistics on - the - fly packets packets NetFlow record each statistics aggregated statistics table records visualized statistics rule name; filter pattern; mapout key; patition&groupsort key;detection condition; action ex) port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1-;srcip,dstip;srcip,dstip;syn-fin=6-37

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with