1 Uncovering the Big Players of the Web 3 rd TMA Workshop Vienna March 12 Vinicius Gehlen Alessandro Finamore Marco Mellia Maurizio M. Munafò TMA COST Action
2 Introduction 2 Nowadays Internet traffic volume is mainly HTTP + P2P Breakdown of downstream traffic of residential customers Mainly SSH, VoIP, DNS, , etc. + A plethora of services!
3 Methodology 3 Focus only on HTTP traffic Rely on to generate flow-level HTTP logs L4: #bytes, #pkts, RTT, etc. L7: service type and meta-data (e.g. video) Rely on organization data base Each server IP is associated to its owner n à AKAMAI TECHNOLOGIES
4 Dataset 4 3 vantage points (VPx) of an ISP in Italy Residential customers ADSL (VP2, VP3) + Fiber-To-The-Home (VP1) 1 week of traffic (20-24 June 2011)
5 5 OVERVIEW Which organizations? Volumes? Popularity?
6 Top10 (+ 1) organizations 6 Rank Org. Name % B % F 1 Google Akamai Leaseweb Megaupload Level Limelight PSINet Webzilla Choopa OVH Facebook Total % Google handles 2x the Akamai volume Besides Google and Akamai, many others known (Level3, Limelight, Leaseweb, Megaupload) less known (PSINet, Webzilla, Choopa) >10k organizations but 65% of volume is due to only 11 big players
7 Organizations popularity 7 % IP clients that have contacted the organization at least one time Organization % Client Video Content SW Update Adv. & Others Google 97.1 YouTube - Google services Akamai 97.2 Vimeo Microsoft, Apple Facebook static content, ebay Leaseweb 64.3 Megavideo - publicbt.com Megaupload 15.6 Megavideo - FileHosting Level YouPorn - Limelight 72.5 Pornhub, Veoh Avast quantserve, tinypic, Photobucket betclick, wdig, trafficjunky PSINet 44.6 Megavideo Kaspersky Imageshack Webzilla 13.2 Adult Video - Filesonic, Depositfiles 97% of clients contact Google and Akamai 63% of client clients contact OVH Advertisement 90.6% of clients contact Facebook!??!!? Choopa zshare OVH 63.1 Auditude - Telaxo, m2cai Facebook 90.6 Facebook - Facebook dynamic content
8 Why FB sees 90% of clients? 8 You visit nutella.com Slurp! There is an embedded object pointing to the FB fan page This generates a connection to FB So FB knows that you like nutella Privacy anyone?!?!
9 Why FB sees 90% of clients? 9 GET /plugins/like.php?href=http%3a%2f%2fwww.facebook.com %2FNutella.Italy&layout=box_count&show_faces=false&width=120&action=like&colorscheme =light&height=65 HTTP/1.1 Host: User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/ (KHTML, like Gecko) Version/5.1.3 Safari/ Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Referer: Accept-Language: en-us Accept-Encoding: gzip, deflate Cookie: presence=em euserfa a2estatefdsb2f0et2f_5b_5delm2fnulleuct2f BEtrFnullEtwF G H0EblcF0EsndF1CEchFDsubF_5b0_5dEp_5f F2CC; p=6; c_user= ; datr=fokptf9njodb9oshmlkp5e_ong..; lu=ggojzj70pxz6grarmpcnirxw; xs=1%3a889419cc0700aee88497c71b1015fa45%3a0%3a ; locale=en_us
10 10 CONTENT SERVED VIDEO? SSL/TLS?
11 Which content is served? 11 Video content is highly predominant
12 Which type of content is served? 12 Video content is highly predominant 90% of Google is YouTube video download
13 Which type of content is served? 13 Video content is highly predominant 90% of Google is YouTube video download The majority of FileHosting is actually video content
14 Which type of content is served? 14 20% of Akamai is Facebook
15 Which type of content is served? 15 20% of Akamai is Facebook Google, Akamai e Facebook have some HTTPS traffic
16 Evolution of HTTPS 16 Compare HTTPS traffic from June 11 and October 11 % Bytes % Flows Organization Jun Oct Jun Oct Google Akamai Leaseweb Level Limelight Megaupload Facebook No. of HTTPS connections are increasing for all the organizations 25% of Facebook connections were HTTPS in October 11 +7% of HTTPS volume for Akamai and Google in < 6 months
18 Flow size 18 The majority of the flows are small >50% of connection have < 10kB File Hosting organizations are serving a lot of short flows
19 Bulk download rate (1/2) 19 Focusing on connections with > 1MB, what is the download rate?
20 ORGANIZATIONS INFRASTRUCTURE BEHIND THE SCENES 20
21 RTT Latency towards the Internet 21 Minimum RTT is measured by Tstat on per-flow base Facebook has 2 locations (100ms e 170ms) Akamai and Limelight are the closest to the ISP (5ms) 3 Google datacenters are preferred Only <30% of request are served by the closest one (12ms)
22 Number of IPs 22 Organization No. % No. Top5 %bytes Google Akamai Leaseweb Level Limelight Megaupload Facebook Total IP addresses serve 65% of HTTP traffic Google handles 2x the Akamai volume with 1/3 of IPs Most of the traffic is served by few preferred IPs within an organization
23 Volume served by %IP 23
24 Bulk download rate (2/2) 24 Considering connection with > 1MB 1 2 All organizations but Akamai (and Facebook) have >90% connection >500kb/s Caching policies may have an impact: 1 Content already available à high bitrate 2 Content retrieved from backend à low bitrate
25 Conclusions 25 We investigated how the web looks like these days Some clear trends are visible Majority of traffic is handled by few big players HTTPS is becoming very popular Lot of datacenters to manage demand It is very difficult to understand Who owns and who serves the content Which policies are used How much data leaks to these players This makes a tangled web which is very hard to discern
26 26 RTT variation during the day
27 Breakdown del traffico HTTP Traffico HTTP su PDF: La frazione di traffico gestita da ciascuna organizzazione è costante nel tempo 27
28 Comparing days & locations 28 Short-term stability with marginal differences with respect to Days of the week Locations of the users