Online Storage and Content Distribution System at a Large-scale: Peer-assistance and Beyond



Similar documents
Understanding the Roles of Servers in Large-scale Peer-Assisted Online Storage Systems

FS2You: Peer-Assisted Semi-Persistent Online Storage at a Large Scale

Quota: Rationing Server Resources in Peer-Assisted Online Hosting Systems

Peer-Assisted Online Storage and Distribution: Modeling and Server Strategies

A Comparative Study of Tree-based and Mesh-based Overlay P2P Media Streaming

International Journal of Advanced Research in Computer Science and Software Engineering

CDN and Traffic-structure

Should Internet Service Providers Fear Peer-Assisted Content Distribution?

Energy Constrained Resource Scheduling for Cloud Environment

Executive Brief for Sharing Sites & Digital Content Providers. Leveraging Hybrid P2P Technology to Enhance the Customer Experience and Grow Profits

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

Scala Storage Scale-Out Clustered Storage White Paper

Distributed Systems Remote Access Integrated Views

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

AUTOMATED AND ADAPTIVE DOWNLOAD SERVICE USING P2P APPROACH IN CLOUD

Distributed Caching Algorithms for Content Distribution Networks

Proxy-Assisted Periodic Broadcast for Video Streaming with Multiple Servers

Self-Compressive Approach for Distributed System Monitoring

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Large-Scale IP Traceback in High-Speed Internet

Internet Video Streaming and Cloud-based Multimedia Applications. Outline

A Measurement of NAT & Firewall Characteristics in Peer to Peer Systems

Firewall Security: Policies, Testing and Performance Evaluation

Availability and Load Balancing in Cloud Computing

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching

A STUDY OF WORKLOAD CHARACTERIZATION IN WEB BENCHMARKING TOOLS FOR WEB SERVER CLUSTERS

A Dell Technical White Paper Dell Compellent

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Content Delivery Networks. Shaxun Chen April 21, 2009

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

A Novel Load Balancing Optimization Algorithm Based on Peer-to-Peer

IMPROVING QUALITY OF VIDEOS IN VIDEO STREAMING USING FRAMEWORK IN THE CLOUD

DDOS WALL: AN INTERNET SERVICE PROVIDER PROTECTOR

The Definitive Guide to Cloud Acceleration

Cyberoam and Bandwidth

EMC Unified Storage for Microsoft SQL Server 2008

Table of Contents. Overview... 1 Introduction... 2 Common Architectures Technical Challenges with Magento ChinaNetCloud's Experience...

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Load Balancing in Fault Tolerant Video Server

Clustering in Peer-to-Peer File Sharing Workloads

Simulating a File-Sharing P2P Network

An Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang

Dynamic File Bundling for Large-scale Content Distribution

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Directions for VMware Ready Testing for Application Software

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Is backhaul the weak link in your LTE network? Network assurance strategies for LTE backhaul infrastructure

Java Bit Torrent Client

BELL LABS METRO NETWORK TRAFFIC GROWTH: AN ARCHITECTURE IMPACT STUDY

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Entropy-Based Collaborative Detection of DDoS Attacks on Community Networks

Concept of Cache in web proxies

High speed Ethernet WAN: Is encryption compromising your network?

QoE-Aware Multimedia Content Delivery Over Next-Generation Networks

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Introduction to NetApp Infinite Volume

TIME EFFICIENT DISTRIBUTED FILE STORAGE AND SHARING USING P2P NETWORK IN CLOUD

On the Feasibility of Prefetching and Caching for Online TV Services: A Measurement Study on Hulu

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Deduplication has been around for several

Protect Microsoft Exchange databases, achieve long-term data retention

Content Distribution Network (CDN)

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Multimedia Data Transmission over Wired/Wireless Networks

Mike Chyi, Micro Focus Solution Consultant May 12, 2010

Cisco WAAS Context-Aware DRE, the Adaptive Cache Architecture

Project Orwell: Distributed Document Integrity Verification

White Paper. Amazon in an Instant: How Silver Peak Cloud Acceleration Improves Amazon Web Services (AWS)

Transcription:

Online Storage and Content Distribution System at a Large-scale: Peer-assistance and Beyond Bo Li Email: bli@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science & Technology IEEE CCGrid @ Shanghai, May 20, 2009

Outline Online Storage and Content Distribution Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

Online Storage and Content Distribution Online hosting service allow users to upload files, of both small and large sizes, onto dedicated servers, to be shared among a potentially large group of interested users

Online Storage and Content Distribution A new type of content distribution service: online storage and file sharing become increasingly popular Alexa, ranks 17 in the world 1-click hosting Daily counts for 3.18% global Internet users

Online Storage and Content Distribution Features compared with conventional P2P file sharing such as BitTorrent Better reliability and service guarantee Ease of use simple URL shared to others, one-click service Little or no software download and configuration

Online Storage and Content Distribution Files hosted in either CDNs or dedicated large data centers Rapidshare, 1500 TB of storage in its data centers, 110 Gb/s Skyrocketing server bandwidth costs: yearly 15~20 million USD impose usage restrictions or/and paid service

Outline Online Storage and Content Distribution Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

Peer-Assisted Online Storage and Distribution Peer-assistance natural but non-trivial in design Balance two extremes - cost-performance tradeoff Server-based Distribution Guarantee file availability at the prohibitive cost of server bandwidth & storage P2P File Sharing Good scalability No guarantees on file availability A Seamless Integration Peer-assisted Online Storage and Distribution

Design Objectives Couple peer upload contribution & strategic server provisioning in a complementary manner Improve file availability & users downloading performance, while conserving substantial server costs Conceptual Design Design Space Space Practical Implementation FS2You

Challenges Large number of files with highly diverse popularity and different sizes Performance (availability) and user experience No or less restriction on user access Uploading (bandwidth) and downloading (storage and availability) Peer-assistance integration Limited or restricted server storage and bandwidth Semi-persistent file availability peer assistance to conserve server bandwidth costs maintain adequate levels of service quality & user experience

Outline Online Storage and Content Distribution Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

General Model & Performance Metrics Important performance metrics to characterize good online storage and distribution systems from different perspectives Multiple files: files: of of diverse popularity & sizes: sizes: Limited server server storage: Limited server server bandwidth: Peer Peer assistance effectiveness: Peer Peer upload/download capacity: j j µ, c µi File availability: attract & serve as many users as possible maintain as high downloading performance as possible System throughput:

Design Space: Storage & Replacement Given a constrained server storage capacity a server storage & replacement strategy determines which set of files to be stored on the server Problem abstraction A classical 0-1 0-1 knapsack problem with with respect to to different objective functions

Design Space: Storage & Replacement To To attract attract & serve serve as as many many users users as as possible To To achieve the the maximum system-wide throughput NP-complete can be solved using a dynamic programming algorithm with a complexity of The static nature not efficient to be used in practical systems Not suitable to be used for the eviction or replacement operation not only the dynamic evolution of user interests on currently stored files but also a continuous flow of newly uploaded files from users

Design Space: Storage & Replacement Simplicity & efficiency are more of a concern in practical system implementations and operations, at a cost of acceptable sub-optimal solution This provides a simple framework for server storage & replacement strategy each file with a profit-to-weight index: files are ranked in descending order by their indices obeying a greedy algorithm to determine those files with relatively high ranks are preferentially stored alternatively, can simply & efficiently identify those with lower ranks, and perform evictions/replacements whenever necessary

Design Space: Storage & Replacement Unify important aspects with tunable design knobs K=0 unpopular-first-eviction strategy Maximize system throughput K=1 balanced consideration between file popularity & size Maximize system-wide file availability K (0,1) various degree of throughput & availability Flexibly applied in practical systems H i dynamically updated adapt to the evolution of user interests file ranking periodically in either a fine or coarse grained manner for eviction/replacement either start from the files with lowest ranks until a certain volume of files are evicted or customize a threshold of H i below which are the candidates for eviction

Illustration: Applicability & Flexibility File availability 0.92 0.9 0.88 0.86 0.84 0.82 Unpopularfirst eviction System throughput 0.8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.6 k Opportunity to achieve both high availability & throughput File availability 3.2 3.1 3 2.9 2.8 2.7 System throughput (GB/second) File availability 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 With more emphasis on file availability, a real-world system with this customization will be demonstrated later k=0 k=0.4 k=1 k=2 0 25 50 75 100 125 150 175 200 225 250 Server Storage Capacity (GB)

Design Space: Bandwidth Allocation What is the optimal server bandwidth allocation across files to achieve the upper bound of system-wide average downloading rate? To To maximize Problem abstraction A classical continuous knapsack problem with with bounded variables

Design Space: Bandwidth Allocation How to design a near-optimal allocation strategy, that is simple enough to be implemented in practical systems? Follow the guideline conveyed by the optimal strategy Allocate more server bandwidth to less popular files with lower peer assistance effectiveness, while allowing popular ones to largely rely on peer assistance rather than server

Design Space: Bandwidth Allocation A simple framework of server bandwidth allocation Each file with a priority index P i inversely proportional to file popularity implying peer assistance awareness Relative weighting Allocate server bandwidth across files according to relative weighting

Server-side design: bandwidth allocation Tunable design knobs wide design spectrum l=-1 request-driven strategy popular files are provisioned with more server bandwidth typically used in traditional server-based systems without peer assistance l=0 water-leveling strategy can practically work well in peer-assisted online storage and distribution systems l mimic the optimal strategy Easily applied in practical systems P i periodically updated adapt to the evolution of user interests file popularity simply captured by recording the file request count over a certain period various degree of file popularity awareness implying peer assistance awareness

Outline Online Storage and Content Distribution Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

Roxbeam Corp. Peer-to-Peer live streaming experiment - 2004 Coolstreaming (Google 1,000,000 entries in 2008) Xinyan Zhang, Jiangchuan Liu, Bo Li, and Peter Yum, Coolstreaming/DONet: a data-driven overlay network for peer-topeer live media streaming, Proc. of IEEE Infocom 2005. Credited as the first large-scale Internet P2P live streaming system Roxbeam Inc. 2005 onward Softbank (Japan), VC, Xinyan Zhang, co-founder Wall street journal incident, Oct 2005, and PPLive Inc Legal content P2P streaming, Japan Yahoo BB (2006), Phoenix TV Online hosting service: FS2You system (2006-2007), Google 800,000 entries in 2009

FS2You: Architecture Tracking Server Channels (files) Info & MD5 Bootstrapping List of peers in channels Hosting Servers Upload/Hosting Download 60 servers A real-world large-scale peerassisted online storage system One of the most popular online hosting services in China Peers Upload Download

Peer Partnership & Content delivery Combine coarse-grained tracking servers & decentralized gossip protocol Periodic partnership update, resilient to peer dynamics Periodic status-report (Peer ID & IP) Content Periodic exchange of Block Maps (BMs) among peers enables them to locate the needed blocks Retrieve distinct blocks from multiple partners Request-from-server conditions No partners (unpopular file or connection fail) None of the partners hold the desired block Aggregate downloading rate from partners < 10 KB/second empirically determined to prevent peers from aggressively consuming server bandwidth

Server-side Strategies: Uploading Service Not only provide online storage, but also cooperate with content distribution Uploading and storage services No size/format limitations attract millions of users 500GB~1TB content routinely uploaded per day One single copy of a file stored in one of the servers

Server-side Strategies: Downloading Service Complement peers to supply file blocks, especially to those suffering poor downloading rates How to properly satisfy a potentially large number of requests without incurring prohibitively high bandwidth costs? 1st block user experience Probabilistic service based on file popularity File popularity index inversely proportional to No. requests Peers in popular channels to largely rely on peer assistance rather than servers Allocate more server resources to unpopular files A specific design design instance with with the the knob knob l l >= >= 0

Outline Online Storage and Content Distribution Design Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

Trace Collection To evaluate the performance of FS2You, we have implemented a detailed logging mechanism 350 GB traces from 3.3 million users, from June 21 to July 18, 2008 Each peer reports activities & status to the log servers Download Event Summary (event-driven) Peer ID, Channel IDs, File size Time of open/close/completion Total downloaded volume, downloaded volume from servers File Source Snapshot (periodically, overhead/accuracy)

Measurement Study Peer assistance What are the typical peer dynamics & behaviors of both short & long period, and the implications on peer resource utilization? Which set of peers contributes most to the system? Peer dynamics and behavior Reflect user demand & Fine tune server strategies File Characteristics Service Quality User Experience File availability & downloading rate File size & type preferences File popularity & request/replica distribution Correlation with peer assistance effectiveness

Overall Scale & Performance: A large number of users Weekend Pattern Crash failures of log server

Overall Scale & Performance: Huge traffic volumes Up to 80% contributed by P2P alleviate server load Even during calm period, Conserve > 70% server bandwidth The architectural & protocol designs in FS2You can scale to a large number of peers, and can withstand the test of a tremendous volume of traffic (in the order of terabytes per day) over a long period of time The cost of server capacity has been substantially saved by peer assistance

File Characteristics: Popularity 47% compressed archives (e.g., zip/rar) most multimedia content 30% videos, 12% audio, 11% others Flatter than Zipf prediction Immutability of files, and the fetch-at-most-once behavior Well fitted with the stretched exponential distribution Useful for workload synthesis

Correlation: File Popularity & Peer Assistance Effectiveness Adjacent-averaging smoothing In general, popular files enjoy higher peer assistance effectiveness Highly popular ones 80%~90% peer assistance effectiveness encouraging! Increasing noise variations in peer assistance as popularity decreases Interestingly, some less popular ones can also enjoy high peer assistance effectiveness, as some used to be popular with sufficient replicas among peers

Server Involvement & Service Quality Valley potential negative effects of the collaboration between current design of request-from-server threshold & server-side probabilistic serving strategy Both files that are completely supplied by servers and those that are mainly supported by P2P enjoy high average downloading performance Most experienced favorable downloading performance: Avg. 66 KB/s; Lowest > 40 KB/s

Further Exploration beyond FS2You s Customization Experiment using real-world data sets Average downloading rate (KB/second) 140 120 100 80 60 40 20 Optimal l = 1 l = 0.5 l = 0 l = 4 0 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 Server bandwidth (GB/second) Still exists potential improvement space with respect to downloading performance

Outline Online Storage and Content Distribution Design Objectives & Challenges Conceptual Design Space FS2You: Architecture & Mechanisms Measurement Results & Discussion Conclusion & Future Work

Conclusion Within the design space, the FS2You system can practically work well at a large scale, with evidences from extensive measurement study Stress-testing: the architectural & protocol designs in FS2You can scale to a large number of peers, and to withstand the test of a tremendous volume of traffic over a long period of time The cost of server capacity can be substantially saved by peer assistance The system provides high file availability & a satisfactory download experience to a large number of users with costeffective server involvement

Conclusion Significant cost savings Server storage capacity vs. daily volumes (50-60 TB) Aggregate server-side bandwidth While FS2You represents a practical instance, the design space is not just restricted to this Conceptual design (guideline) vs. practical implementation Simplicity and engineering issues Tunable knobs offer the flexibility to adapt to a wide range of design preferences for service providers & designers

Future Work (1) Unveil Inefficiencies & explore the causes as the system scales up to a large population, there still exist channels (files), times & scenarios where & when the download experience is unsatisfactory discover issues that are counter-intuitive or hidden apply statistical tools/data analysis/mining techniques to traces Different design with both theoretical analysis & practical implementation Rate or throughput optimization vs. file completion per unit time

Future Work (2) Empirically determined need to be fine-tuned based on both theoretical optimization & real-world experiences Interaction Peer-side request-from-server threshold High threshold improve peer downloading rates But may potentially incur excessive load on the servers Server-side probabilistic supply strategy Helps to reduce such server load But may sometimes leave out some peers in the cold, who indeed need help from servers How to to find find an an optimal strategy to to balance both sides in in large-scale systems?

Thanks! Q&A