Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop



Similar documents
LESSON Networking Fundamentals. Understand TCP/IP

Overview. Securing TCP/IP. Introduction to TCP/IP (cont d) Introduction to TCP/IP

Transport Layer Protocols

How do I get to

Solution of Exercise Sheet 5

We will give some overview of firewalls. Figure 1 explains the position of a firewall. Figure 1: A Firewall

Overview of TCP/IP. TCP/IP and Internet

Protocols. Packets. What's in an IP packet

Protocols and Architecture. Protocol Architecture.

PART OF THE PICTURE: The TCP/IP Communications Architecture

TCP Packet Tracing Part 1

IP Network Layer. Datagram ID FLAG Fragment Offset. IP Datagrams. IP Addresses. IP Addresses. CSCE 515: Computer Network Programming TCP/IP

How To Design A Layered Network In A Computer Network

Sage ERP Accpac Online

Sage 300 ERP Online. Mac Resource Guide. (Formerly Sage ERP Accpac Online) Updated June 1, Page 1

Hands-on Network Traffic Analysis Cyber Defense Boot Camp

1. The Web: HTTP; file transfer: FTP; remote login: Telnet; Network News: NNTP; SMTP.

Networks and Security Lab. Network Forensics

Internet Protocol: IP packet headers. vendredi 18 octobre 13

Network Traffic Analysis

Access Control: Firewalls (1)

Technical Support Information Belkin internal use only

Networking Test 4 Study Guide

Frequently Asked Questions

(Refer Slide Time: 02:17)

Lecture 2-ter. 2. A communication example Managing a HTTP v1.0 connection. G.Bianchi, G.Neglia, V.Mancuso

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Limitations of Packet Measurement

Question: 3 When using Application Intelligence, Server Time may be defined as.

What TCP/IP Protocol Headers Can Tell Us About the Web

CPS221 Lecture: Layered Network Architecture

Measurement of the Usage of Several Secure Internet Protocols from Internet Traces

IP - The Internet Protocol

Lecture 28: Internet Protocols

COMP416 Lab (1) Wireshark I. 23 September 2013

Note! The problem set consists of two parts: Part I: The problem specifications pages Part II: The answer pages

First Workshop on Open Source and Internet Technology for Scientific Environment: with case studies from Environmental Monitoring

Safe network analysis

Overview of Computer Networks

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Network Traffic Evolution. Prof. Anja Feldmann, Ph.D. Dr. Steve Uhlig

Understanding TCP/IP. Introduction. What is an Architectural Model? APPENDIX

INTERNET SECURITY: FIREWALLS AND BEYOND. Mehernosh H. Amroli

Network Packet Analysis and Scapy Introduction

Ethernet. Ethernet. Network Devices

COMP 3331/9331: Computer Networks and Applications. Lab Exercise 3: TCP and UDP (Solutions)

Guide to Network Defense and Countermeasures Third Edition. Chapter 2 TCP/IP

Encapsulating Voice in IP Packets

Firewall Introduction Several Types of Firewall. Cisco PIX Firewall

ΕΠΛ 674: Εργαστήριο 5 Firewalls

Announcements. Lab 2 now on web site

Objectives of Lecture. Network Architecture. Protocols. Contents

Introduction to Computer Networks

Network Security TCP/IP Refresher

EXPLORER. TFT Filter CONFIGURATION

Computer Networks & Security 2014/2015

Unix System Administration

SSL DOES NOT MEAN SOL What if you don t have the server keys?

Overview. Protocol Analysis. Network Protocol Examples. Tools overview. Analysis Methods

BASIC ANALYSIS OF TCP/IP NETWORKS

Network Security In Linux: Scanning and Hacking

Basic Networking Concepts. 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet

CSE 3461 / 5461: Computer Networking & Internet Technologies

1 Data information is sent onto the network cable using which of the following? A Communication protocol B Data packet

Voice over IP. Demonstration 1: VoIP Protocols. Network Environment

Chapter 3. TCP/IP Networks. 3.1 Internet Protocol version 4 (IPv4)

Hadoop Technology for Flow Analysis of the Internet Traffic

Chakchai So-In, Ph.D.

Lecture 23: Firewalls

ΕΠΛ 475: Εργαστήριο 9 Firewalls Τοίχοι πυρασφάλειας. University of Cyprus Department of Computer Science

Mobile IP Network Layer Lesson 02 TCP/IP Suite and IP Protocol

TCP/IP Protocol Suite. Marshal Miller Chris Chase

Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently.

DO NOT REPLICATE. Analyze IP. Given a Windows Server 2003 computer, you will use Network Monitor to view and analyze all the fields of IP.

File Transfer And Access (FTP, TFTP, NFS) Chapter 25 By: Sang Oh Spencer Kam Atsuya Takagi

Connecting with Computer Science, 2e. Chapter 5 The Internet

Course Overview: Learn the essential skills needed to set up, configure, support, and troubleshoot your TCP/IP-based network.

Why SSL is better than IPsec for Fully Transparent Mobile Network Access

Transformation of honeypot raw data into structured data

Chapter 9. IP Secure

Classification of Firewalls and Proxies

Sample Network Analysis Report

Network Simulation Traffic, Paths and Impairment

Transport and Network Layer

Network Layer IPv4. Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS. School of Computing, UNF

Communications and Networking

Network Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei

Ethereal: Getting Started

Lecture 15. IP address space managed by Internet Assigned Numbers Authority (IANA)

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme. Firewall

Protocol Data Units and Encapsulation

Network forensics 101 Network monitoring with Netflow, nfsen + nfdump

MASTER'S THESIS. Testing as a Service for Machine to Machine Communications. Jorge Vizcaíno 2014

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

The OSI and TCP/IP Models. Lesson 2

IP Subnetting and Addressing

A host-based firewall can be used in addition to a network-based firewall to provide multiple layers of protection.

Limi Kalita / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (3), 2014, Socket Programming

VisuSniff: A Tool For The Visualization Of Network Traffic

CSE331: Introduction to Networks and Security. Lecture 12 Fall 2006

Transcription:

Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop R. David Idol Department of Computer Science University of North Carolina at Chapel Hill david.idol@unc.edu http://www.cs.unc.edu/~mxrider Abstract Data is commonly exchanged between hosts over the Internet using the Transmission Control Protocol (TCP) and Internet Protocol (IP) protocols. TCP, built on top of IP, associates each piece of data (packet) sent between two unique applications on different hosts with an ongoing connection, or flow, between these hosts. This paper presents an overview of TCP flows, a methodology for detecting flows given a large network traffic trace file, and the results of analyzing a 20GB trace. This analysis provides information about the average duration of a flow, average number of packets sent during a flow, average number of bytes sent during a flow, average idle time during a flow, and average throughput during a flow. These results are further categorized by known application protocols, and an analysis of the different characteristics of each protocol is given. 1. Introduction Transmission Control Protocol (TCP) is a transport-layer networking protocol used for sending and receiving data over the Internet. TCP allows the sending of data packets, or finite-sized chunks of information, between two hosts. TCP is built on top of the Internet Protocol (IP), which has its own notion of data segments (called frames). One important aspect of the IP protocol is the concept of an IP address, which is a unique number used (in the range of 0 to 4294967295 for IPv4) to represent a host on the Internet. IP frames that are sent over the Internet include the source and destination IP addresses inside header fields, and these fields are used to route the frame to the destination IP address. TCP packets add an additional routing layer on top of IP addresses known as a port number. A port number is a unique number (in the range of 0 to 65535 for IPv4/6) that specifies which application running on the host should receive the packet. The source and destination port numbers are included in the TCP packet headers. The use of port numbers allows applications to receive only the data relevant to it rather than having to filter through all of the data received by that host. The routing of TCP packets to specific applications is done internally on the given host by the Operating System (OS), as opposed to IP address routing, which is handled by the routers along the network. Because the OS uses TCP port numbers to determine which packets are relevant to which applications, certain applications have port numbers that are reserved for that application. These applications provide their own application-layer protocols on top of TCP/IP (typically) and expect packets received on their specific protocols to conform to these particular protocol specifications. In general, ports 0 to 1024 are special reserved ports that applications cannot use unless they are implementing the associated application-layer protocol, as enforced by the OS. While higher numbered ports do not have this restriction, many applications will still use a specific port number for all associated traffic (in a sense

reserving that port). On the other hand, some applications may choose to use a dynamic or random port number, and as such it may not be possible to determine the originating application of a packet with a specific port number. TCP is a connection-oriented protocol, which means that an abstract connection is established between the two hosts. This connection is must be created with a special handshake protocol, and once active, all packets sent between the two hosts are guaranteed to be delivered reliably and in the order in which they are sent. Because the IP protocol provides no such guarantees, TCP must utilize special mechanisms in order for these properties to hold. TCP packets are given sequence numbers and every packet received by a remote host must be acknowledged by sending an ACK packet back to the original sender. If a packet is not acknowledged or it is determined that packets were not received in the correct order, then the original sender will retransmit the packets as many times as needed to correct this. All packets sent during a single, ongoing TCP connection is known as a flow. This paper presents analysis of TCP flows for several common protocols using data collected from the University of North Carolina s campus network. The approach and methodology used to gather the data and perform the analysis is discussed in Section 2. Section 3 presents the results and analysis. Section 4 concludes the paper. 2. Approach and Methodology In order to analyze common protocol flows in a way that is both general and significant, it is important to select a proper data set for analysis. The data set should be large to reduce the statistical weight of outliers and ensure a large sample size, it should be obtained from multiple types of users and hosts to introduce a variety of use cases, and it should include all significant information about the flows in order to produce meaningful analysis. The data set chosen for the analysis in this paper is a trace of all traffic on the UNC campus network taken over a period of several hours during the evening of August 3, 2004. As such, the trace satisfies the requirement of being large (more than one million packets are recorded in the trace the entire file exceeds 20GB in size). Given that the trace covers traffic from all users of the university s network, it arguably satisfies the requirement that many different types of users are represented in the trace, although there is clearly a bias towards the types of users present on a university network (namely students and faculty). Thus, we cannot conclude that any analysis of this trace applies directly to the general public. The network trace was recorded in libpcap format, a binary-based format that contains records of all TCP/IP packet information as well as a timestamp of when each packet was sent. This format is commonly used to record network traces, as it is space-efficient and can be parsed quickly. Many common traffic-sniffing tools, such as tcpdump and Wireshark, use the libpcap format. In order to preserve anonymity and protect the privacy of the university network s users, all packet payload content was removed. As such, only the packet headers are recorded in the trace. Due to the fact that the data set is so large, traditional tools for analysis were not feasible. Placing the data into a traditional relational database on a single computer and using traditional Structured Query Language (SQL) queries, for example, would be impractical due to the sequential nature of the query operations. In order to perform fast analysis of the data, the Apache Hadoop framework was used. This framework allows for parallel processing of large data sets over a distributed system. This processing is done using

the MapReduce model, in which a large data set is split into chunks, processed in parallel, and analyzed to produce outputs; at which point the outputs are assembled back together [1]. More specifically, after a master process splits the input file into chunks, each chunk is given to a mapper processes that then maps the chunk into a collection of keys with associated values. These collections are then further processed in parallel by the reducer processes, which perform the actual analysis of the mapped data. In this case, a Hadoop cluster, consisting of approximately 300 interconnected machines utilizing the MapReduce model, was used to process the data. One challenge of using Hadoop is that the input data must be easily parsed as well as partitioned. The libpcap format does not allow data partitioning at arbitrary locations it must be read as a whole rather than allowing reads to begin at arbitrary positions. Thus is due to the fact that packets are represented as dense binary data and can be arbitrary lengths (thus splitting at an arbitrary offset may end up splitting in the middle of a packet) [2]. Some recent open source solutions to parse pcap files using Hadoop do exist, such as the hadoop-pcap library [3]. After several failed attempts at using such libraries no an actual Hadoop cluster, however, an alternative solution was explored. In order to solve this problem and run quick and effective Hadoop-based analysis programs, the data set was converted to a tab-delimited, plaintext format (Figure 1). This conversion process was done as a preprocessing step using the tool tshark, a part of the Wireshark command line tool suite [4]. The exact command used to process the data is as follows: tshark -T fields -n -r inputdata.pcap -e frame.time -e tcp.len -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport > outputdata.txt As seen in the above command, the timestamp, length (size in bytes), source IP address, source TCP port, destination IP address, and destination TCP port fields of the packet were selected to be saved in the output text and the rest of the data was discarded. Aug 3, 2004 13:00:00.415828000 0 194.142.110.75 2744 79.164.138.159 4000 Aug 3, 2004 13:00:00.415829000 0 79.40.135.118 2472 79.29.157.153 1755 Aug 3, 2004 13:00:00.415856000 1460 79.29.237.240 52932 51.99.73.36 1553 Aug 3, 2004 13:00:00.415869000 1460 79.29.237.240 52932 51.99.73.36 1553 Aug 3, 2004 13:00:00.415874000 536 79.29.237.240 80 183.45.4.122 1370 Aug 3, 2004 13:00:00.415887000 1460 79.29.208.60 16228 234.39.153.68 4522 Aug 3, 2004 13:00:00.415891000 1460 31.35.119.13 80 79.157.5.110 4880 Aug 3, 2004 13:00:00.415896000 0 18.112.13.245 1037 79.29.237.240 80 Aug 3, 2004 13:00:00.415898000 1332 79.29.237.125 8000 116.44.185.0 3001 Figure 1: Converted plaintext data These fields were chosen because they are sufficient to determine which TCP flow the packet belongs to as well as useful in recording statistics about that flow. The goal of the analysis was two-fold: the first goal was to logically partition the input data into TCP flows and gather information about the number of packets transmitted during the flow, number of bytes sent during the flow, duration, idle time, and average throughput of the flow. Afterwards, the second goal was to take that flow data and perform an additional analysis of application flows that use well-known port numbers (as described in the Introduction). The desired output information for the per-protocol analysis consists of the averages of the above metrics (number of packets transmitted in a flow for that protocol, the average duration of a flow, the average number of bytes sent during a flow, the average idle time of a flow, and the average throughput of a flow). These numbers were obtained by running two Hadoop programs: the first program processed the plaintext input to determine flows. In order to determine the individual flow of a given packet, the source IP

address/port number and the destination IP address/port number were extracted. Given that a flow is any communication between these two endpoints, it contains both packets sent from host A to host B as well as packets sent from host B to host A. For each flow, the size of each packet was added to a running total used to determine the total number of bytes sent in the flow. Subtracting the timestamp of the last packet sent in the flow from the timestamp of the first packet sent in the flow produces the duration of the flow. Calculating idle time required setting a delay threshold between packet transmissions (1 minute), above which any time elapsed is considered idle time. After these metrics were calculated, the ratio of bytes sent/duration was calculated for each flow to determine the average throughput. The second Hadoop program reduced the scope of the analysis to a subset of known protocols: File Transfer Protocol (FTP), Secure Shell (SSH), Simple Mail Transfer Protocol (SMTP), Hyper Text Transfer Protocol (HTTP), Quicktime streaming, Valve Steam, Xbox Live, AOL Instant Messenger (AIM), Virtual Network Computing (VNC), and Gnutella network. These protocols were chosen partly due to their variety (streaming vs. non-streaming data, text-based vs. binary-based, etc.) and partly due to the fact that they use well known port numbers, thus reducing the likelihood that packets associated with other unknown/unconsidered protocols are factored into the results. The second Hadoop program was supplied with a list of the port numbers for each of the above protocols as input. From there, it grouped each flow that had an endpoint with a known port number together and performed analysis by totaling all of the metrics collected by the first program and then averaging them over all of the associated flows. 3. Results The results of the first program, giving analysis on a per-flow basis, are too large to include completely in this document as over 300,000 flows were identified and analyzed. The averages are as follows: Avg. duration (ms) Avg. num. packets sent Avg. num. bytes sent Avg. idle time (ms) Avg. throughput (bytes/sec) 338239905.4 75.97722542 91986.30436 335087230.5 0.270217448 The results of the second program, giving analysis of all flows on a per-protocol basis, are as follows: Duration (ms) Number of packets Number of bytes sent Avg. idle time (ms) Avg. throughput (bytes/sec) FTP data (port 20) 637193997 996.2266 1389350 595604624 2.1804204 HTTP (port 80) 301682110 41.277035 51818 299802703 0.17176574 AIM (port 5190) 354741766 13.963832 5125 353983451 0.014449714 VNC (port 5800) 66633000 5 1121 66393000 0.016823497 SSH (port 22) 602602515 1350.285 832451 570201755 1.3814279 Quicktime Streaming (port 554) 743724721 1593.9512 1962336 675147475 2.638525

Xbox Live (port 3074) 259300235 34.294117 38121 257660000 0.1470174 SMTP (port 25) 625108274 45.82221 41740 622928575 0.06677397 Steam (port 1725) 367820062 49.046875 61500 365121750 0.16720383 Gnutella (port 6346) 544693413 282.71384 214660 530808184 0.39409506 Avg. duration (ms) 544693413 367820062 625108274 259300235 743724721 637193997 301682110 354741766 66633000 602602515 Figure 2: Average duration per protocol 49.046875 Avg. number of packets 45.82221 34.294117 282.71384 1593.9512 996.2266 1350.285 41.277035 13.963832 5 Figure 3: Average number of packets per protocol

61500 41740 38121 Avg. number of bytes sent 214660 1389350 1962336 832451 51818 5125 1121 Figure 4: Average number of bytes sent per protocol Avg. idle time (ms) 530808184 365121750 622928575 675147475 257660000 595604624 299802703 353983451 66393000 570201755 Figure 5: Average idle time per protocol

Avg. throughput 365121750 530808184 595604624 299802703 257660000 622928575 675147475 353983451 66393000 570201755 Figure 6: Average throughput per protocol The results of the above analysis show that application protocols that involve the sending or receiving of large files (such as FTP and Quicktime Streaming) typically have more packets sent than those that do not (such as SMTP). Surprisingly, the average flow duration between the different protocols is roughly the same. A degree of error was likely introduced to these results given that the port numbers of the applications outside the range of reserved ports may not be stable or correct. In addition, some applications rely on specialized protocols that affect the lifecycle of a TCP connection such as load balancers, in which a client makes an initial request to a known server that is subsequently handled by an entirely different server on a possibly different IP address or TCP port number, thus segmenting what is conceptually one flow into multiple TCP connections. 4. Concluding Remarks In addition to providing data and analysis relevant to the specific application protocols examined in this paper, it is the hope of the author that these methods continue to be used to give insight into protocols at the flow level. Application layer protocols are seldom analyzed at such a high level typically analysis is done at the packet or message (concatenated packets that make up a single application-level transmission) level. Such information leads to a greater understanding of the protocol as it works over time as well as how factors such as user interaction delay can affect response time and throughput. For example, factors such as flow duration can be important due to the fact TCP connections have associated overhead to set up and tear down. The analysis presented in this paper could easily be extended to provide important metrics such as the average number of packet retransmissions caused by TCP, the average size of a file transferred in a specific FTP session, etc.

5. References [1] Dean, J., Ghemawat, S., MapReduce: Simplified Data Processing on Large Clusters, Proceedings of OSDI, 2004 [2] Harris, G., Development/LibpcapFileFormat, Wireshark, http://wiki.wireshark.org/development/libpcapfileformat, March 2011. [3] Nagele, W., RIPE-NCC/hadoop-pcap, GitHub, https://github.com/ripe-ncc/hadoop-pcap, January 2013. [4] Wireshark Documentation, tshark The Wireshark Network Analyzer 1.8.0, Wireshark, http://www.wireshark.org/docs/man-pages/tshark.html.