Monitoring BitTorrent Swarms

Similar documents
Lecture 6 Content Distribution and BitTorrent

The BitTorrent Protocol

Incentives Build Robustness in BitTorrent

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

Should Internet Service Providers Fear Peer-Assisted Content Distribution?

P2P File Sharing: BitTorrent in Detail

Data Deduplication in BitTorrent. Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

CSCI-1680 CDN & P2P Chen Avin

3. Some of the technical measures presently under consideration are methods of traffic shaping, namely bandwidth capping and bandwidth shaping 2.

arxiv: v1 [cs.ni] 8 Nov 2010

DDoS Vulnerability Analysis of Bittorrent Protocol

P2P: centralized directory (Napster s Approach)

The Algorithm of Sharing Incomplete Data in Decentralized P2P

Peer-to-Peer Networks. Chapter 2: Initial (real world) systems Thorsten Strufe

The Challenges of Stopping Illegal Peer-to-Peer File Sharing

Final for ECE374 05/06/13 Solution!!

The Internet is Flat: A brief history of networking over the next ten years. Don Towsley UMass - Amherst

BitTorrent Peer To Peer File Sharing

Systems for Fun and Profit

SE4C03: Computer Networks and Computer Security Last revised: April Name: Nicholas Lake Student Number: For: S.

Evaluating the Effectiveness of a BitTorrent-driven DDoS Attack

RWC4YD3S723QRVHHHIZWJXPTQMO6GKEQR

Multicast vs. P2P for content distribution

Avaya ExpertNet Lite Assessment Tool

CNT5106C Project Description

PEER TO PEER FILE SHARING USING NETWORK CODING

Multihoming and Multi-path Routing. CS 7260 Nick Feamster January

Delft University of Technology Parallel and Distributed Systems Report Series. The Peer-to-Peer Trace Archive: Design and Comparative Trace Analysis

Network Security. Mobin Javed. October 5, 2011

Attacking a Swarm with a Band of Liars evaluating the impact of attacks on BitTorrent

Peer-to-Peer Networks

Anonymous Communication in Peer-to-Peer Networks for Providing more Privacy and Security

Experimentation with the YouTube Content Delivery Network (CDN)

Limitations of Packet Measurement

Outline. Outline. Outline

Distributed Systems. 23. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2015

From Centralization to Distribution: A Comparison of File Sharing Protocols

Seminar RVS MC-FTP (Multicast File Transfer Protocol): Simulation and Comparison with BitTorrent

Department of Computer Science Institute for System Architecture, Chair for Computer Networks. File Sharing

Video Streaming with Network Coding

Java Bit Torrent Client

An Introduction to Peer-to-Peer Networks

A Measurement of NAT & Firewall Characteristics in Peer to Peer Systems

IPDR vs. DPI: The Battle for Big Data

SiteCelerate white paper

IPv6 First Hop Security Protecting Your IPv6 Access Network

HW2 Grade. CS585: Applications. Traditional Applications SMTP SMTP HTTP 11/10/2009

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Behavior Analysis of TCP Traffic in Mobile Ad Hoc Network using Reactive Routing Protocols

Overview of Routing between Virtual LANs

Intelligent Content Delivery Network (CDN) The New Generation of High-Quality Network

Peer-to-Peer (P2P) applications, including both P2P streaming and P2P

Using OSPF in an MPLS VPN Environment

Intelligent Routing Platform White Paper

Hints and Implications of Player Interaction

The Value of Content Distribution Networks Mike Axelrod, Google Google Public

An Efficient Load Balancing Technology in CDN

Agenda. Taxonomy of Botnet Threats. Background. Summary. Background. Taxonomy. Trend Micro Inc. Presented by Tushar Ranka

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

The Internet and the Public Switched Telephone Network Disparities, Differences, and Distinctions

N6Lookup( title ) Client

Giving life to today s media distribution services

Optimal Network Connectivity Reliable Network Access Flexible Network Management

How To Provide Qos Based Routing In The Internet

CS 5480/6480: Computer Networks Spring 2012 Homework 4 Solutions Due by 1:25 PM on April 11 th 2012

A Week in the Life of the Most Popular BitTorrent Swarms

Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD

Analysis on Leveraging social networks for p2p content-based file sharing in disconnected manets

High Availability Failover Optimization Tuning HA Timers PAN-OS 6.0.0

Introduction to IPv6 and Benefits of IPv6

Internet Content Distribution

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

An apparatus for P2P classification in Netflow traces

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Internet Control Protocols Reading: Chapter 3

Improving Deployability of Peer-assisted CDN Platform with Incentive

Distributed Systems. 25. Content Delivery Networks (CDN) 2014 Paul Krzyzanowski. Rutgers University. Fall 2014

You can probably work with decimal. binary numbers needed by the. Working with binary numbers is time- consuming & error-prone.

Wharf T&T Limited DDoS Mitigation Service Customer Portal User Guide

Sync Security and Privacy Brief

Three short case studies

Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems

Assignment #3 Routing and Network Analysis. CIS3210 Computer Networks. University of Guelph

Procedure: You can find the problem sheet on Drive D: of the lab PCs. 1. IP address for this host computer 2. Subnet mask 3. Default gateway address

Controlling the Internet in the era of Software Defined and Virtualized Networks. Fernando Paganini Universidad ORT Uruguay

Revisiting P2P content sharing in wireless ad hoc networks

How To Create A P2P Network

The OSI model has seven layers. The principles that were applied to arrive at the seven layers can be briefly summarized as follows:

Lab 5: BitTorrent Client Implementation

CS268 Exam Solutions. 1) End-to-End (20 pts)

Peer-to-peer filetransfer protocols and IPv6. János Mohácsi NIIF/HUNGARNET TF-NGN meeting, 1/Oct/2004

An Active Packet can be classified as

Content Delivery Network (CDN) and P2P Model

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

1.1.1 Introduction to Cloud Computing

LIST OF FIGURES. Figure No. Caption Page No.

SHARPCLOUD SECURITY STATEMENT

EECS 489 Winter 2010 Midterm Exam

Transcription:

Monitoring BitTorrent Swarms António Manuel Rebelo Alves Homem Ferreira Dissertação para a obtenção de Grau de Mestre em Engenharia de Redes de Comunicações Júri Presidente: Orientador: Co-orientador: Vogais: Prof. Doutor Paulo Jorge Pires Ferreira Prof. Doutor Ricardo Jorge Feliciano Lopes Pereira Prof. Doutor Fernando Henrique Corte Real Mira da Silva Prof. Doutor João Coelho Garcia September 2011

ii

Acknowledgments To all those who helped me and supported me over this journey, from teachers to collegues, friends to family, i thank you all. iii

iv

Resumo Os protocolos Peer-to-Peer, especialmente o BitTorrent, são responsáveis por uma grande maioria do trafego gerado na Internet, tendo um grande impacto sobre o tráfego inter-isp e, consequentemente, nos custos de peering dos ISPs. Através da monitorização de mais de 3200 enxames reais num ambiente de Internet, descobrimos que existe uma grande quantidade de localidade que pode ser explorada e utilizada para diminuir o tráfego inter-isp. Discutimos a relação entre o tamanho do enxame, a popularidade do conteúdo e a localidade existente e demonstramos que, mesmo enxames de pequeno tamanho têm propriedades de localidade. Também observámos que existem enxames que partilham conteúdo específico de uma região, demonstrando uma elevada localidade. Durante a experiência também descobrimos que existe uma quantidade de conteúdo repetido a ser partilhado na rede. Vários peers têm a tendência de publicar o mesmo conteúdo através de ficheiros de torrent diferentes, criando vários enxames independentes que acabam a partilhar um grande número de partes comuns. Esta redundância pode ser explorada de forma a aumentar a disponibilidade dos dados e a diversidade de origem dos mesmos, bem como a localidade existente. Para explorar esta redundância, propomos uma nova técnica com o nome de Partial Swarm Merger, que adiciona um novo componente à infra-estrutura BitTorrent, permitindo que os peers possam descobrir outros enxames que partilhem conteúdo comum. Com esta informação, os diferentes peers podem participar nos diferentes enxames, anunciando e solicitando de cada enxame as partes em comum com o seu download. Desta forma, a disponibilidade das partes em comum aos vários enxames, aumentará. Palavras-chave: BitTorrent, Peer-to-Peer, Monitorização, Localidade, Conteúdo repetido, Disponibilidade v

vi

Abstract Peer-to-Peer protocols, specially BitTorrent, account for most traffic generated in the Internet, having a great impact on inter-isp traffic and thus ISPs peering costs. However, through locality mechanisms, P2P traffic can be contained close by in the network and even in the same ISP, decreasing inter-isp traffic. Through the monitoring of over 3200 live Internet swarms, we found that there is a lot of locality that can be exploited and used to decrease inter-isp traffic. We discuss the relationship between the swarm size, content s popularity and the existing locality and find that even small swarms have some locality properties. We also observed swarms sharing content specific to a region and thus showing a great amount of locality. During the experiment we also discovered that there is a significant amount of repeated content being shared. Various publishers tend to publish the same content through different torrent files, creating independent swarms that end up sharing a large number of common parts. This redundancy can be exploited in order to increase data availability and source diversity, as well as the existing locality. To deal with this redundancy, we propose a novel technique, called Partial Swarm Merger, which adds a new component to the BitTorrent infrastructure, allowing peers to learn about swarms with common content. With this information, peers could combine the different swarms, announcing and requesting from each swarm the pieces in common with their download. This will increase the availability of the parts which are common to the several swarms. Keywords: BitTorrent, Peer-to-Peer, Monitoring, Measurements, Locality awareness, Repeated content, Availability vii

viii

Contents Acknowledgments........................................... iii Resumo................................................. v Abstract................................................. vii List of Tables.............................................. xi List of Figures............................................. xiv 1 Introduction 1 1.1 The Problem........................................... 1 1.2 Work description......................................... 2 1.3 Structure of this thesis...................................... 3 1.4 Publications............................................ 3 2 BitTorrent Protocol 5 3 State of the art 9 3.1 Locality Solutions......................................... 9 3.1.1 Locality through Client.................................. 9 3.1.2 Locality through peer and ISP cooperation....................... 11 3.1.3 Locality through ISP alone............................... 12 3.1.4 Comparison between solutions............................. 14 3.2 Locality studies.......................................... 14 3.2.1 Studies of BitTorrent s locality.............................. 14 3.2.2 Comparison between related studies.......................... 16 3.3 Content availability........................................ 17 4 Methodology for gathering and analysing data 21 4.1 System architecture....................................... 21 4.2 Data analysis methodology................................... 23 5 Results 25 5.1 Content Analysis......................................... 25 5.1.1 Content pollution..................................... 25 5.1.2 Content repetition.................................... 26 ix

5.2 Locality Analysis......................................... 31 5.2.1 Repeated content.................................... 31 5.2.2 All content......................................... 33 5.2.3 Regional content..................................... 36 5.2.4 Large swarms...................................... 38 5.2.5 Two-hour period..................................... 38 5.3 Peer and Tracker behavior.................................... 41 5.3.1 Tracker behavior..................................... 41 5.3.2 Peer behavior....................................... 42 5.4 Summary............................................. 44 6 Partial Swarm Merger 47 6.1 PSM................................................ 47 6.2 Use Case............................................. 49 7 Conclusions and future work 51 7.1 Conclusions............................................ 51 7.2 Future Work............................................ 52 Bibliography 56 x

List of Tables 3.1 Comparison between some solutions already implemented.................. 15 3.2 Comparison between studies to the locality potential in BitTorrent.............. 18 5.1 Torrent aggregation benefits................................... 32 xi

xii

List of Figures 2.1 (1) Obtaining peers from tracker to be able to join the swarm, (2) exchanging data with other peers and (3) obtaining more active peers on the swarm................ 6 4.1 Work flow of the system...................................... 22 5.1 CDF with the percentage of unique pieces for each torrent file................ 26 5.2 Maximum swarm size and number of seeders for all swarms representing pollution with a maximum swarm size value above 50 peers.......................... 26 5.3 Time, in hours, swarms sharing polluted content take to drop to 20% of their maximum size................................................. 27 5.4 CDF with the shared percentage of pieces per number of torrent pairs with at least one common piece.......................................... 27 5.5 CDF with the shared MegaBytes per number of torrent pairs with at least one common piece................................................ 28 5.6 Number of torrents published by team............................. 29 5.7 Histogram of the content repetition frequency......................... 29 5.8 Average number of peers per content.............................. 30 5.9 Average number of seeders per content............................. 30 5.10 Number of peers per country for similar content........................ 31 5.11 Number of peers per ISP for similar content........................... 32 5.12 Increase in swarm size at each ISP by aggregating similar content.............. 33 5.13 CDF with the content size for all content considered as being pollution............ 34 5.14 Average number of peers obtained and average number of peers per country for the oneday data aggregation period................................... 34 5.15 Average number of peers obtained and average number of peers per ISP for the one-day data aggregation period...................................... 35 5.16 Distribution of the percentage of the median number of peers that belong to the same country or ISP for the one-day data aggregation period.................... 35 5.17 Countries with an average above 30 peers per day for a regional torrent........... 36 5.18 ISPs with an average above 10 daily peers for a regional torrent............... 36 xiii

5.19 Regional torrents with 60% of all peers belonging to the same country, at least 75% of the times................................................ 37 5.20 Regional torrents with 30% of all peers belonging to the same ISP, at least 75% of the times. 37 5.21 Torrent size and maximum number of seeders for swarms with a maximum number of seeders above 5000........................................ 39 5.22 Average number of peers obtained vs average number of peers per country for the twohour period data aggregation................................... 40 5.23 Average number of peers obtained vs average number of peers per ISP for the two-hour period data aggregation...................................... 40 5.24 Distribution of the percentage of the median number of peers that belong to the same country or ISP for the two-hour period data aggregation.................... 41 5.25 Distribution of peer per each tracker per torrent, for torrents announced in more than 8 trackers............................................... 42 5.26 Swarm size throughout time per tracker, for a video file torrent................ 42 5.27 Day-night behavior for a video file torrent............................ 43 5.28 Number of seeders per swarm size............................... 43 5.29 Swarm size per content size................................... 44 5.30 Swarm size throughout time................................... 44 6.1 PSM - Populating databases.................................. 49 6.2 PSM s workflow.......................................... 50 xiv

Chapter 1 Introduction 1.1 The Problem In the last few years, Peer-to-Peer (P2P) communication have increased exponentially [37] and proven to be one of the most successful architectures for providing a number of services like VoIP, video streaming and of course file sharing. Due to this rapid growth, Internet Service Providers (ISPs) have been seeing their inter-isp traffic increase, mainly because of file sharing protocols. The main responsible is BitTorrent [9] which represents most of the P2P traffic generated worldwide [37]. The reason why P2P and especially BitTorrent are so popular is because of their unique properties. In P2P networks every peer can be both server and client, can connect to any other peer and the network doesn t depend on any peer, allowing peers to enter and leave the network at any time. This means that there are no infrastructure costs. The only cost is the residential grade equipment, so it is easier and faster for a user to put his content online to be shared. P2P protocols also offer a very high scalability. Besides these P2P characteristics, the BitTorrent protocol also has mechanisms of robustness like the random selection of peers to which it connects to, a tit-for-tat algorithm to ensure a consistent download rate and guarantee that the client is connected to peers with roughly the same bandwidth as itself. These characteristics and others allow BitTorrent to be able to work well in networks with few peers and even better in networks with lots of peers. There are a number of solutions for decreasing the inter-isp traffic generated by BitTorrent but some, like traffic shaping, are difficult to implement due to the fact that BitTorrent doesn t use standard ports and can encrypt its traffic. Other solutions like caching are discouraged because some of the files shared infringe copyrights. One solution to the inter-isp traffic problem are locality mechanisms. These mechanisms force peers to prefer connections to other peers in the same ISP or country rather than peers half way around the globe and, at the same time, maintain BitTorrent s performance. The real goal is to provide a win-win solution for both user and ISP, where users get faster downloads, since they are connected to other peers in a close by network, and ISPs decrease the amount of inter-isp traffic by increasing their intra-isp traffic. However, implementing a locality mechanism isn t as trivial as it may sound. In a P2P network there are no infrastructures and a peer can enter and leave the network at any 1

time. This makes the network very unpredictable and thus poses some difficulties in the implementing of locality mechanisms. BitTorrent also has other mechanisms that represent major barriers for the implementation of locality mechanisms, such as the random selection of peers to whom it connects to. Another characteristic of P2P protocols is the low barrier of entry for publishing content which has enabled many to publish the works of others. However, users tend to download the content from sources they trust. These sources are usually groups of individuals (publisher teams) that have made a reputation for themselves by competing with each other to be the first group to publish a specific content. This competition between groups often results in the creation and publishing of different torrent files that represent the same or very similar content. This fact is a source of redundancy which is not exploited by the conventional BitTorrent protocol. In BitTorrent, swarms are identified by an hash of information regarding the content being shared. Even if two swarms are sharing the same content, just by having a different file name, these swarms will be isolated from each other and won t exchange any kind of data. To fully understand the BitTorrent protocol, the peers and the shared content, it is necessary to monitor and analyse live Internet swarms. This enables us to extract important information about the peers and swarms behavior, the existing locality in these swarms and identify patterns in the network. With all this information, we can check if the implementation and deployment of locality mechanisms, or other mechanisms like the ones that explore content redundancy, are viable and what kind of modifications to the protocol are needed to take the most advantage of such mechanisms. 1.2 Work description Through the monitoring of over 3200 BitTorrent swarms, we were able to gather peer data, as well as information regarding the content being shared in each swarm. With this data, the work was divided into three studies: (1) a locality study, which presents all findings regarding locality in each swarm, (2) a content study that focuses on the analysis of the content being shared across the different swarms, and (3) a peer related study, which discusses peer s behavior and other findings. The locality study shows that there is a great amount of locality to be explored in BitTorrent swarms, and analyzes all monitored swarms to quantify this locality. We show that the existing locality is correlated with the swarm size and thus the content s popularity, since popular content is expected to have very large swarms. However, we also found swarms that represented content specific to a given country, region or language which showed a remarkable locality potential, despite their often small size. Another result observed was the fact that, even small, non-regional swarms, showed some locality that can be exploited. This locality was analysed in periods of one day and in periods of two hours and the results show that, for small sized content and fast download speeds, a locality mechanism would benefit much if it was also implemented in the tracker, substituting the random peer selection algorithm. Regarding the content analysis, we discuss the redundancies found in the repeated contents being shared across the network. We show that the competition between publisher teams results in the existence of different swarms, sharing the same or similar content but isolated from each other and discuss how this affects the performance of BitTorrent. We demonstrate that combining these different swarms 2

into a larger one will result in the improvement of content availability and existing locality. We propose a solution called Partial Swarm Merger (PSM) as an efficient way to exploit the content redundancy. This solution would be based on a service outside the BitTorrent network and would not require any modifications to the BitTorrent protocol. However, BitTorrent client applications would need an extension in order to use the service. As for the peer related study, we show some findings regarding peer behavior. We show a daynight behavior, where the swarm periodically decreases and increases in size. We also present the distribution of peers through the different trackers for torrent files announced in several different trackers. We show that, the current lack of load balancing mechanisms between trackers results in some trackers having much larger swarms than others, affecting availability and locality. Another behavior we found was that the percentage of seeders in the swarm tends to increase with the swarm size, making it very close to 100% for very large swarms. This behavior shows that free-riding is not a significant problem in BitTorrent, and that most peers are willing to share what they download. The data for all these studies was obtained by downloading and analysing torrent files, and gathering information on the associated swarms. The latter was performed using an instrumented BitTorrent client application running on several PlanetLab[8] nodes that periodically queried trackers for swarm and peer information. The peers IP addresses were later converted to geographic position and to the corresponding ISP. This operation was accomplished with the help of MaxMind s GeoIP databases. 1.3 Structure of this thesis This thesis is organized as follows: Chapter 2 explains the BitTorrent protocol in detail. In Chapter 3 we present and discuss all of the related work, regarding locality mechanisms and studies, as well as, studies and approaches for improving availability of content. After the related work, Chapter 4 presents our methodology for gathering and analysing data and Chapter 5 presents all our findings. Then, we present our approach, PSM, for improving availability and locality through redundancy exploitation, as well as a use case scenario, in Chapter 6. Finally, Chapter 7 presents the final conclusions, future work and achievements. 1.4 Publications So far, the work developed in this thesis has resulted in one paper. This paper was submitted to be presented at a Conference and we are now waiting for its evaluation. * António Homem Ferreira, Ricardo Lopes Pereira and Fernando M. Silva. Partial Swarm Merger: Increasing BitTorrent content availability. Submitted to the INFOCOM 2012. This article discusses the existance of isolated swarms sharing the same content in BitTorrent networks, and how the merging of these swarms can increase content availability. In the paper, we introduce PSM, explaining how it would work and could be implemented. 3

4

Chapter 2 BitTorrent Protocol Before presenting the work described in this paper, we provide a brief review of the BitTorrent protocol. BitTorrent is a P2P protocol for file sharing where peers share files among themselves, supporting the upload costs. Unlike other file-sharing P2P protocols such as emule 1 or Kazaa 2, BitTorrent doesn t provide any mechanism for file search. Its goal is just to exchange and replicate files. This means that all file searches are done outside the network. There are two main components on the BitTorrent protocol: 1. Tracker: Provides a list of peers sharing a given file. Can also receive and log information about upload/download rates and other details for statistical purposes. 2. Peers: Share a given file among them. There are two types of peers: the ones that have already finished the download of the file, called seeders, and the ones still downloading the file, called leechers. To share a content, a peer needs to create a torrent file and publish it. This file contains metainformation like: (1) file names and sizes, (2) tracker(s) Uniform Resource Locator (URL), (3) the hashes for each file part (piece) and the fixed piece size, (4) comments, creation date, encoding and other information on the content and files. After publishing the torrent file, usually in a webpage, interested users can download and open it with a BitTorrent client application. This application reads the file and queries the tracker for a list of active peers for that same file. After receiving the list, it connects to the peers and starts downloading the file. All file distribution is done between peers. Trackers don t get involved in the file sharing process. These two steps in the file download are represented in Figure 2.1 as step 1 and 2. Peers exchange blocks of data from a data aggregate which contains one or more files concatenated. The exchange unit is the piece, which has a fixed size. Each piece is associated with a SHA1 hash, found in the torrent file and used to verify its integrity. After downloading and verifying the hash, a peer informs every peer connected to it that it already has that piece available for upload. These hashes are 1 http://www.emule.com/, last accessed August 2011 2 http://www.kazaa.com/, last accessed August 2011 5

1 2 3 Tracker Request peers List of active peers Tracker Peer 2 Tracker Request pieces Pieces Peer 2 Request more peers List of active peers Request peers List of known peers Peer 2 Establish connection Request pieces Peer 1 Establish connection Peer 1 Peer 1 Peer 3 Pieces Peer 4 Swarm Swarm Peer 3 Peer 3 Swarm Figure 2.1: (1) Obtaining peers from tracker to be able to join the swarm, (2) exchanging data with other peers and (3) obtaining more active peers on the swarm. the same for equal pieces of the same size, being very unlikely that different pieces happen to produce the same hash. Peers periodically query the tracker for other peers sharing the same content. However, peers can also query each other to obtain other active peers in their swarm. This can be done using Peer Exchange (PEX) or through DHTs[13]. This is represented in step 3 in Figure 2.1. Each swarm is identified by the infohash generated from the torrent file. The infohash is an urlencoded 20-byte SHA1 hash generated with the information in the info key of the torrent file. This information includes: piece size, the hash of all pieces, the file name, the file size and other information. This way, two torrent files that refer to the same file can produce two different infohashes just by changing the file s name. Despite sharing the same content, these two torrent files will be associated with two different and isolated swarms that do not share any information with each other. As for trackers, they typically provide 50 peers chosen using a random algorithm but are free to use better peer selection algorithms. When a peer starts downloading a file, it doesn t have any part of that file to share so it is very important to start downloading any piece to get a complete one. As such, for the first piece, a random algorithm is used to choose which piece to download. As for the rest of the pieces to download, they are chosen by a rarest-first algorithm. This algorithm determines that a peer should select pieces to download that most of the peers connected to it don t have. Popular pieces are left for later download. This way, file sharing in the network is faster since rare parts tend to become popular overtime and the protocol robustness is increased since it assures that rare parts don t become unavailable with a peer leaving the network. Rarest-first also guarantees a good performance even when there is only one seed sharing a given file. Peers will download from it different rare parts and eventually start sharing parts between themselves, not depending entirely on the one seed, making download faster. There is also another property in the piece selection algorithms to get pieces downloaded quickly: if a piece has already been requested, the algorithm has preference in finishing that piece before requesting a new one. As for the end of the download, there is also a mechanism to prevent delay from requesting a piece from a peer with a low transfer rate: the last sub-pieces are requested to all peers. 6

BitTorrent also has a way of handling peer communication through its tit-for-tat algorithm, also known as choking algorithm. Through this mechanism, peers chose to whom they upload their pieces, choking (not uploading to) the others. There are a fixed number of slots for upload and every 10 seconds, the average download rate for the last 20 seconds for each peer is calculated. After this calculation, the peers with the highest average are unchoked. They are given a slot and upload to these peers starts. For the rest of the peers, no uploading is done. However, this mechanism has a problem: it may come a time where it is impossible to find other peers that may allow higher download rates. To solve this problem, there is a single slot that uses optimistic unchoking which unchokes a peer every 30 seconds (by default) regardless of its download rate. This algorithm is equivalent to always cooperating in the Prisoner s Dilemma. This way, every peer has a chance to compete for an upload slot and it can be assured that if there is a peer with higher download rate than the ones occupying the upload slots, it will eventually get a slot for itself. All these properties make BitTorrent very scalable, provide a very fast way to share files and discourage free-riding [19] where peers don t share what they download. 7

8

Chapter 3 State of the art Many studies have been done regarding P2P network monitoring and, especially BitTorrent monitoring. In this Chapter, we focus on most of the work done in the field of locality and content availability. Regarding the locality issue, there are three main solutions: (1) solutions implemented only on the Bit- Torrent client, (2) solutions for BitTorrent and ISP cooperation and (3) solutions implemented only in the ISP. In Solutions implemented only on the BitTorrent client, a client computes the distance between itself and other peers and chooses to connect to the ones close to it. Another solution is to have BitTorrent clients and trackers cooperate with ISPs. Since ISPs have inside information about the networks topology, they can know exactly where peers (in their own network) are and specify rules and paths for the communication between each of them. These two different solutions are based on peer selection where a peer is supposed to connect to and download the file from other peers that are close to it. However, there are other solutions implemented by ISPs alone like caching, traffic shaping or even traffic blocking. These are the most difficult to implement because of P2P networks and BitTorrent protocol characteristics, such as the ability to conceal its traffic. Regarding content availability, most work presented focuses on bundling and other technics that can be used to increase the availability of unpopular content by grouping and sharing related content. The main goal of these solutions is to supply a mechanism that enables BitTorrent and P2P file sharing protocols to maintain high availability for a given content, even after the initial flash crowd moment. 3.1 Locality Solutions 3.1.1 Locality through Client A BitTorrent client uses a random algorithm to choose the peers it is connected to. This is the first thing to be changed when a locality mechanism is implemented. In a client solution, a different algorithm is used, one that makes the peer connect to other peers close to it, belonging to the same ISP, for example. Ono [6] is an Azureus 1 extension that uses a biased peer selection mechanism. This mechanism re- 1 Azureus is a bittorrent client. http://azureus.sourceforge.net/, last accessed August 2011 9

lies on the Content Distribution Networks (CDNs). These networks have replica servers all over the globe, in many ISPs, and through dynamic DNS redirection, a client is sent to the nearest CDN servers. This means that two clients that are redirected to the same CDN servers are likely to be close to each other. Ono uses this information to build ratio maps that represent the proximity of a given peer to a CDN server and then the peer uses these ratio maps to determine if other peers are close to it. If two peers have similar ratio map values, there is a high probability that they are close to each other. This solution doesn t need any additional information about the network topology and no new infrastructure but doesn t give any guaranties that it will decrease the inter-isp traffic since not every ISP has its own CDN server. However, the authors evaluation of the work developed shows that it reduces the inter-isp traffic since, over 30% of the time, it finds peers that belong to the same AS as itself. The authors also show that their solution increases download and upload rates, whenever bandwidth is available, and that peers are located along paths with lower RTTs than those chosen randomly. Other solutions suggest some modifications either in the tracker s peers list (the one requested and sent to peers) and client s peer selection algorithm or just in P2P traffic shaping devices [3]. The peers list is built with the help of internet topology maps or ISP s information about its IP address range, so that peers connect to others close to them. However, peers don t all connect to just the ones from the same ISP, they also establish and maintain some connections to peers outside the ISP so that they can have a global view of the pieces existing in the network. To evaluate the usage of Biased Neighbour Selection mechanisms, a discrete-event simulator was built and used. Their conclusion is that a biased peer selection method works well but should be combined with other mechanisms such as bandwidth throttling or caching. To have the network work as desired, there should also be a number of peers outside the same ISP in a peer s list. In [5], another approach was made. To study how a BitTorrent locality mechanism would improve user and ISP experience, a number of solutions are implemented, analyzed and measured in a real internet AS topology. Solutions include changes in BitTorrent protocol at peer selection level, choking/unchoking level and a piece picker locality. The peer selection method the authors implemented is very similar to the one in [3]. The tracker keeps an association between peers and AS hop counts and uses this list to select the peers to send when a request for it is made. The modifications to the choking/unchoking algorithm consisted in choking and unchoking peers based on the network distance among each other (AS hop count) instead of peer s uploading speed. However, the optimistic unchoking was left untouched. As for the piece picker locality method, it substituted the rarest-first so that a peer is encouraged to download first the pieces close to it. The evaluation of the solutions shows that tracker locality (peer selection method) is able to achieve low AS hop count while choker and piece picker locality decrease the download time, in comparison with standard BitTorrent. Although the tracker locality was able to retain traffic close by in the network, it had a problem: peer workload was not being distributed evenly. In their conclusions, the authors suggest that there should be a tradeoff between locality and peer workload. The authors of [33] also focused their work on Biased Neighbor Selection and Biased Unchoking methods. Like [5], they modified the peer selection algorithm and the choking algorithm to prefer connections to peers based on the AS hop count. However, they also decided to change the optimistic unchoking al- 10

gorithm. While in the standard optimistic unchoking method all peers choked have the same probability of being optimistically unchoked, in their approach, this probability was associated with the distance to these peers. The closer a peer is, the higher the probability it is optimistically unchoked. The authors studied and compared the usage of Biased Neighbor Selection, Biased Unchoking and both and the results show that the combination of both methods achieves the best performance in both locality and download speed. Their experiment also showed that Biased Unchoking works best in high load situations. There are also other solutions that include a biased peer selection through real-time measurements of pings, traceroutes, AS hops, etc. An example of a work in this area is the development of full BitTorrent client application program called TopBT [40]. TopBT is a topology aware client that needs neither ISP nor services like CDNs to find peers close by. It uses traceroute and ping probes from time to time to determine a proximity to a peer, and map the already established connections to corresponding AS hops and link hops. After obtaining this information, it unchokes peers based on the routing hops, download rates and reciprocal upload rates. This way, there isn t even the need to have the platform largely deployed to see results in the network s traffic. However, this work focuses more in decreasing the download time and BitTorrent traffic in the network, since it tries to connect to peers with lower hop count, and not necessarily solving the inter-isp traffic problem (although it can contribute to solve this problem). The AS hop count determines the number of ASes that a packet has to go through from source to destination. Although a low AS hop count can help contain the traffic close by in the network, there are no guarantees that it will be contained inside the same ISP unless AS hop count is equal to zero. Their evaluation of the BitTorrent client was done by deploying it onto 106 PlanetLab and residential hosts and downloading various torrent files in each node/host. The results, when compared to the ones obtained from a regular BitTorrent client, show that TopBT has a lower download time and can reduce up to 25% induced BitTorrent traffic. One other work that also focuses on exploiting locality in P2P networks is Adaptive Search Radius (ASR)[35]. The authors show that both ISPs and peers could gain if the network was used more efficently. ASR is a peer selection mechanism that defines a search radius, measured in network hops with the peer in the center, and only connects to other peers inside that search radius. However, this mechanism dinamically changes the search radius according to the file parts availability. This way, download time is not affected and the client application still uses the network in an efficient way. This search radius only affects the downloads, not the uploads. Through simulations, the authors show that, despite the fact that ASR proved to accomplish better results than BNS, these two approaches complement each other and would benefit if combined. The remaining results show that both peers and ISPs would benefit from the usage of ASR in an internet topology network. 3.1.2 Locality through peer and ISP cooperation There are also some approaches that rely on a tight cooperation between peers and ISPs, like the one presented in [1]. This work focuses on a service provided by the ISP, an Oracle Service that users can 11

query to get information on the underlying network, enabling peer selection according to ISP defined criteria. When peers query the Oracle, they send it a list of possible P2P neighbors and the oracle ranks each of them according to some criteria defined by the ISP. Since the service is provided by the ISP and it has full access to information regarding the underlay network, the Oracle can rank P2P neighbors based on anything, from topological information to link congestion or cost. This service has many advantages towards the other solutions that lack cooperation between peers and ISPs: (1) peers don t need to waste valuable time and resources measuring path performance, (2) there is no need to infer network distances since the ISP has real time information about all of its links, (3) bottlenecks can be avoided thus improving user experience and (4) ISPs can direct traffic away from expensive links or congestion ones. Despite these advantages, there is a problem regarding the mutual trust between peers and ISPs. Since there are a lot of files being shared illegally, users tend to avoid using these services. As for ISPs, they are very reluctant in giving away information about their own network. The system was evaluated in a Testlab and using a modified Gnutella 1 protocol to query the Oracle upon peer entering. The results show that the number of Query messages, flowing in the network, decreases and that messages tend to stay inside the same AS. Future work for this project includes evaluation of the work running it in PlanetLab. Another solution created in this area is P4P [44]. P4P follows the same objective as the Oracle that is to supply underlay network information to the overlay P2P network but does it in a different way. In this approach there is an itracker located in each ISP that supplies network layer information about the network on which it is. This component has all the information to rank P2P neighbors through both network distance and ISP defined criteria. There is also another component called an apptracker. When a peer wants to find other peers close to it, it queries the apptracker for a peer list. In order to create this list, the apptracker communicates with the itracker from each ISP to build a list with peers mainly from same AS. However, some peers in the list should be from different ASes as the requesting peer so that the robustness of the P2P protocol is maintained. In the end, the list of peers is sent back to the requesting peer. This approach was evaluated through simulation and real Internet experiments and is already deployed. The results show that it can be a promising approach to solve inter-isp traffic problem as well as maintain P2P performance. IETF has also formed a working group for ALTO (Application Layer Traffic Optimization) [39] with the objective of defining a protocol for P2P applications to query a service about the underlying network topology. They discuss the requirements for a standard ALTO protocol and also focus on some issues like security, privacy and service discovery for ALTO usage. 3.1.3 Locality through ISP alone The paper [36] discusses the major locality-aware solutions that can be implemented: (1) blocking traffic, (2) using network caching, (3) shaping traffic and (4) using stateful policy management. Each one is examined and it is determined if the solution is able to solve the problem or not. 1 Gnutella is a P2P file sharing protocol. http://rfc-gnutella.sourceforge.net/, last accessed August 2011 12

1. Blocking traffic[36]: The aim of this solution is to reduce the traffic and consequently the bandwidth associated with P2P communication. This blocking is usually done by blocking ports that are associated with P2P networks. However, P2P applications often use dynamic ports and allow users to choose which ports to use, making this solution very difficult to implement. In the case of Bit- Torrent, by blocking P2P traffic to the outside network, the tit-for-tat algorithm might start choosing peers from the same ISP to share the contents. However, if there aren t enough peers in the same ISP sharing that same content, users experience will be harmed. Although this mechanism can reduce costs of P2P traffic to zero, its disadvantages are enough to discourage its implementation. 2. Using network caching[36]: This is a workable solution that consists of the ISP maintaining a cache of popular P2P files and redirecting the client to that cache. By doing this, traffic is reduced and kept as much as possible inside the ISP s network. Despite all this, these solutions have some problems associated with illegal content sharing, where no copyrights are paid. It can only cache legal contents and so cannot be used in every situation since there are files being shared illegally in many P2P networks. 3. Shaping traffic[36]: Its objective is to analyze and label traffic with different priority levels. This way, P2P traffic is identified, given a low priority and thus consumes less bandwidth. With this approach ISP can have great control over traffic in its network. However, P2P applications adopted several mechanisms to avoid being labeled as such by hiding the nature of the packets being exchanged in the network. For example, BitTorrent can encrypt data exchanged between peers, making it very hard to examine and label. For this reason, traffic shaping is very difficult to perform accurately. There are also several problems regarding user experience. Since all packets need to be examined, tagged and wait for transmission based on its priority, a delay is most likely to be added to the communication. Despite the cost reduction obtained from using this mechanism, it is not an effective way of managing all P2P traffic. 4. Stateful Policy Management[36]: Uses statefull, deep-packet inspection to intelligently identify, label and redirect P2P traffic away from expensive links. It can manage both downstream, through redirections, and upstream, controlling the number of connections to outside networks. This is achieved by facilitating the connection between peers inside same ISP. If a peer is trying to download a file from an external source, the Stateful Policy Management mechanism can check if another peer inside its network is sharing that same file and redirect the connection to this internal source. This solution is transparent to the subscriber and doesn t have the problems associated with traffic shaping since user experience remains the same. This is the best solution when compared to the others. Besides these solutions, there are others like LiteLoad [18] that follows a Stateful Policy Management approach. LiteLoad is a system that detects and manages P2P communications arriving from or to a peer and does it completely unaware of the content being shared by peers. It requires no blocking, caching or shaping of traffic. It simply looks for patterns of communication that identify existing P2P networks. Through filters in the ISP s network, the packets that match known messages for one or 13

more P2P protocols (for example, session initiation messages) are sent to LiteLoad and, based on predefined rules, the system decides to let it reach its original destination or changes the destination s IP address of the packet to another peer s IP address. The new destination can be an internal peer (to the ISP) or a specific external one, avoiding expensive links. LiteLoad has a pool with both internal and external peers found so there is no problem in redirecting communications to another peer, rather than the original destination. Despite filtering and redirecting messages, it has no interest in the content and doesn t analyze it or build any kind of index with it. This is done for two reasons: (1) make the system unaware of the content being shared to avoid being able to identify illegal content and (2) make the system able to work under both exposed and encrypted protocols, since it only needs the header to do its job. The system was evaluated through a simulation with real peers and a proof of concept was made. It was also tested with emule on obfuscated mode (data encrypted) and proven to work. The authors future work is to implement the system on a large scale ISP and develop the system to become more flexible to deal with more behavioral patterns of the users. 3.1.4 Comparison between solutions To summarize, Table 3.1 shows a simple comparison between the different implementations of locality mechanisms, mentioned in the previous sections. For each solution it holds the locality mechanism used, where it is implemented and what dependencies it has. By analyzing the table, we can see that locality mechanisms used by the different solutions have different dependencies, depending on where they are implemented. Implementations that just rely on modifications to the BitTorrent client application tend to depend on other services already deployed, like CDN networks, or on internet topology maps, for retrieving information on the AS hop count between each peer and itself. ASR and TopBT are the exceptions. While TopBT only needs an IP to AS map and its own traceroute results to be able to calculate the AS hop count between itself and other peers, ASR has no dependencies. However, its results depend on the percentage of peers in the network using ASR as well. Solutions like the Oracle or P4P give ISPs a method to redirect P2P traffic as they wish and only depend on the ISP for the network topology information. Solutions implemented by the ISP alone, also have the goal to redirect P2P traffic away from undesirable links, however they depend on machines and methods capable of identifying this same traffic. 3.2 Locality studies 3.2.1 Studies of BitTorrent s locality Despite an increasing number of solutions, there is still a lot of work to be done in studying the viability of locality mechanisms. The main objective of these approaches is to decrease inter-isp traffic, in a way that doesn t affect the user experience. Since P2P networks are very dynamic and thus very unpredictable, one must first understand the protocol, how it works and how users take advantage of it. One of the first studies regarding the problem of the inter-isp traffic caused by P2P communications 14

Solution Locality Mechanism Implementation Dependencies Ono Proximity to a CDN server BitTorrent client application CDN network TopBT Proximity to other peers, both AS hop count and delay BitTorrent client application ASR Biased Peer Selection Biased Unchoking Oracle/P4P Traffic shaping/caching/ etc Proximity to other peers based on network hops Proximity to other peers based on AS hop count Proximity to other peers based on AS hop count Proximity to other peers based on rules defined by ISPs Redirecting traffic away from expensive links LiteLoad Redirecting traffic away from expensive links P2P client application BitTorrent client application and/or Tracker BitTorrent client application ISP or independent service (small modifications to the BitTorrent client might also be needed) IP to AS maps (for calculating AS hops) User experience depends on the percentage of peers in the network running ASR Internet Topology Maps and/or ISP (for IP address information) Internet Topology Maps. Should be combined with Biased Peer Selection ISP (for network topology information) ISP Capability for identifying/caching/shaping traffic ISP Capability of identifying patterns of user communication in P2P networks Table 3.1: Comparison between some solutions already implemented. and how the costs of content distribution are shifting from CDNs to ISPs was [21]. This study analyzed BitTorrent tracker logs and payload packet traces collected at the edge of a 20,000 user access network. After analyzing all this data, the conclusion was that there was enough locality to be exploited. In the end, the performance of some solutions was evaluated like having proxy-trackers on the edge of a network or based on domain names, matching rules or network-aware clustering [22]. Another study [30] debates the difficulties related to locality mechanisms in BitTorrent and compares mechanisms such as Ono, latency, shortest path and same AS. Their conclusion proves that locality exists but isn t enough to be exploited without degrading the network s robustness. Although previous solutions and studies are also related to the work that is going to be done and is explained in this document, they focus their attention in getting results through the implementation of solutions. This work wants to monitor several real-life swarms in real-time and real network topology. One of the first approaches to the monitoring of real-life swarms was [23]. In this paper, the lifetime of a torrent was followed throughout 5 months. The torrent observed was a Linux Redhat 9 distribution and 180 thousand clients downloaded this torrent during the 5 months. This monitoring wasn t exactly to study locality issues but rather BitTorrent itself. Still, geographic information associated with each peer was logged. Since only one torrent was monitored (that was known from the start that it would be popular), this paper can easily show every step in a torrent s life, especially the flash crowd moment. There was one work [42] where the authors, claiming that studies until then seldom discussed the content and peer diversities, monitored BitTorrent swarms for a few thousands video and non-video files. 15

They downloaded (automatically) every torrent they could from a specific web page for advertising torrent files, and had an application running in Planetlab to gather as much information as possible about the file and peers downloading and sharing it. The results of their real-world measurements show that a global locality approach is not the best choice since most AS clusters don t have potential for locality. In [11], the authors main goal was to answer questions such as: what are the win-win boundaries for ISPs and their users? or what is the maximum amount of transit traffic that can be localized without requiring fine-grained control of inter-as overlay connections?, among others. They collected 100k torrent files and then constantly queried tracker to gather all information associated with each torrent. Each peer was associated to its own ISP and download speed. This speed wasn t measured in real time but taken from speed-test services available. With this data, they studied the performance of several mechanisms as means for locality. Their conclusions say that through locality mechanisms, in most cases, there is a win-win situation. Although the last two papers got real results from real-world measures, they only studied public trackers where there are no incentives for prolonged sharing of a file. For this reason, the authors of[43] took a different approach. They followed torrents in both private trackers and public trackers and compared both results. Since in private trackers peers have reputation which consists of an upload/download ratio, peers have a need to share what they download for much longer than in public trackers. This is the so called Share Ratio Enforcement[43]. Also, private trackers are usually not opened to anyone like public ones and often ban users that lack on sharing. The need for uploading as much as a peer can is called, as referred in the paper, uploading starvation. In monitoring both torrent and users activities in private trackers and comparing to public ones, it can be seen that both have very different behavior and both make very different contributions for BitTorrent traffic in the network. In a more recent study, [17] follows the one before, but also gets a lot of focus in the popularity of the torrent file. As expected and proven by measurements, different torrents and associated swarms have different locality awareness. For example, English video files tend to be popular worldwide while Spanish or German video files tend to be popular only in some regions, as expected. Other aspect that was observed in this work was a clear statistical indication for day-night behavior. Their conclusions show that the bigger the swarm, the more potential it has for locality, as expected, and that the variation for the number of peers per AS is quite high which can impose some difficulties in the implementation of locality mechanisms. 3.2.2 Comparison between related studies Table 3.2 shows a comparison between some of the studies of the BitTorrent network previously mentioned, showing the different aspects monitored in each of these studies. As we can see from the compared studies, [23] was the one that collected more information, especially regarding swarm and individual peer lifetime. However, it only followed one torrent file, so only one swarm was observed. As for [17], it measured the swarm and individual peer lifetime on selected swarms and only through the course of several days. For this reason, their measurement was considered limited. As for peer 16