Wikimedia architecture. Mark Bergsma <mark@wikimedia.org> Wikimedia Foundation Inc.

Similar documents
Wikimedia Architecture Doing More With Less. Asher Feldman Ryan Lane Wikimedia Foundation Inc.

Wikimedia architecture. Ryan Lane Wikimedia Foundation Inc.

Wikimedia Infrastructure. Roan Kattouw

bla bla OPEN-XCHANGE Open-Xchange Hardware Needs

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

Practical Load Balancing

Tushar Joshi Turtle Networks Ltd

Drupal Performance Tuning

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Designing, Scoping, and Configuring Scalable Drupal Infrastructure. Presented by David Strauss

Evolution of Web Application Architecture International PHP Conference. Kore Nordmann / <kore@qafoo.com> June 9th, 2015

Wikipedia Sociographics. Jimmy Wales President, Wikimedia Foundation Wikipedia Founder

SCALABLE DATA SERVICES

Common Server Setups For Your Web Application - Part II

WebGUI Load Balancing

Cloud Based Application Architectures using Smart Computing

Building Scalable Web Sites: Tidbits from the sites that made it work. Gabe Rudy

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

Are You Ready for the Holiday Rush?

LARGE SCALE INTERNET SERVICES

BORG DIGITAL High Availability

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Accelerating Wordpress for Pagerank and Profit

CS 188/219. Scalable Internet Services Andrew Mutz October 8, 2015

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

Table of Contents. Overview... 1 Introduction... 2 Common Architectures Technical Challenges with Magento ChinaNetCloud's Experience...

Large-Scale Web Applications

SiteCelerate white paper

Serving 4 million page requests an hour with Magento Enterprise

Database Scalability {Patterns} / Robert Treat

An overview of Drupal infrastructure and plans for future growth. prepared by Kieran Lal and Gerhard Killesreiter for the Drupal Association

How To Balance A Load Balancer On A Server On A Linux (Or Ipa) (Or Ahem) (For Ahem/Netnet) (On A Linux) (Permanent) (Netnet/Netlan) (Un

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

InterWorx Clustering Guide. by InterWorx LLC

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

REQUIREMENTS LIVEBOX.

Assignment # 1 (Cloud Computing Security)

MySQL and Virtualization Guide

WINDOWS AZURE EXECUTION MODELS

Embracing Open Source: Practice and Experience from Alibaba. Wensong Zhang The 11 th Northeast Asia OSS Promotion Forum

Implementing Reverse Proxy Using Squid. Prepared By Visolve Squid Team

Why your extension will not be enabled on Wikimedia wikis in its current state!

Learning Management Redefined. Acadox Infrastructure & Architecture

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com

BASICS OF SCALING: LOAD BALANCERS

Disaster Recovery Checklist Disaster Recovery Plan for <System One>

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

High Availability Solutions for the MariaDB and MySQL Database

E-commerce is also about

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Hypertable Architecture Overview

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

Dependency Free Distributed Database Caching for Web Applications and Web Services

Tableau Server 7.0 scalability

Zadara Storage Cloud A

In Memory Accelerator for MongoDB

An overview of the Drupal infrastructure and plans for future growth

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

9 Tried and Tested Tips to Increase the Power of your Magento Store

DNS ROUND ROBIN HIGH-AVAILABILITY LOAD SHARING

WordPress Optimization

CiteSeer x in the Cloud

Scalable Architecture on Amazon AWS Cloud

HDFS Users Guide. Table of contents

Internet Content Distribution

Yahoo! Communities Architectures Ian Flint

Social Networks and the Richness of Data

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

always available Cloud

Diagram 1: Islands of storage across a digital broadcast workflow

Building a Hosted, Multi-Tenent Contact Center Environment for Thousands of Agents

Enabling Technologies for Distributed Computing

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

LifeSize UVC Video Center Deployment Guide

Apache Traffic Server Extensible Host Resolution

June Blade.org 2009 ALL RIGHTS RESERVED

Introduction. Examples of use cases:

Web Application Hosting in the AWS Cloud Best Practices

Web Application Hosting Cloud Architecture

Web Development. Owen Sacco. ICS2205/ICS2230 Web Intelligence

Drupal High Availability High Performance

TECHNOLOGY WHITE PAPER Jun 2012

History of Disaster - The BioWare Community Site

Lesson 7 - Website Administration

Ensuring scalability and performance with Drupal as your audience grows

W3Perl A free logfile analyzer

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

CS514: Intermediate Course in Computer Systems

Designing and Implementing Scalable Applications with Memcached and MySQL

How To Understand The Power Of A Content Delivery Network (Cdn)

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

DATA COMMUNICATOIN NETWORKING

Amazon Web Services Yu Xiao

Transcription:

Mark Bergsma <mark@wikimedia.org> Wikimedia Foundation Inc.

Overview Intro Global architecture Content Delivery Network (CDN) Application servers Persistent storage Focus on architecture, not so much on operations or nontechnical stuff

Some figures The Wikimedia Foundation: Was founded in June 2003, in Florida Currently has 9 employees, the rest is done by volunteers Yearly budget of around $2M, supported mostly through donations Supports the popular Wikipedia project, but also 8 others: Wiktionary, Wikinews, Wikibooks, Wikiquote, Wikisource, Wikiversity, Wikispecies, Wikimedia

Some figures Wikipedia: 8 million articles spread over hundreds of language projects (english, dutch,...) 110 million revisions 10th busiest site in the world (source: Alexa) Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers

Some technical figures 30 000 HTTP requests/s during peak-time 3 Gbit/s of data traffic 3 data centers: Tampa, Amsterdam, Seoul 350 servers, ranging between 1x P4 to 2x Xeon Quad- Core, 0.5-16 GB of memory...managed by ~ 6 people

Pretty graphs!

Pretty graphs!

Architecture: LAMP... PHP Users Apache web server Linux MySQL

...on steroids. Users GeoDNS Image server (Lighttpd) Distributed Object Cache (Memcached) PHP LVS Squid caching layers LVS MediaWiki Application servers (Apache) Core databases (MySQL) HTTP MySQL NFS Memcached DNS HTCP Invalidation notification Profiling Logging Search (Lucene) External storage

Content Distribution Network (CDN) 3 clusters on 3 different continents: Primary cluster in Tampa, Florida Secondary caching-only clusters in Amsterdam, the Netherlands and Seoul, South Korea Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster Works by statically mapping IP addresses to countries to clusters

Squid caching HTTP reverse proxy caching implemented using Squid Split into two groups with different characteristics Text for wiki content (mostly compressed HTML pages) Media for images and other forms of relatively large static files

Squid caching 55 Squid servers currently, plus 20 waiting for setup ~ 1 000 HTTP requests/s per server, up to 2 500 under stress ~ 100-250 Mbit/s per server ~ 14 000-32 000 open connections per server

Squid caching Up to 40 GB of disk caches per Squid server Disk seek I/O limited The more disk spindles, the better! 8 GB of memory, half of that used by Squid Up to 4 disks per server (1U rack servers) Hit rates: 85% for Text, 98% for Media, since the use of CARP

People & Browsers PowerDNS (geo-distribution) Primary datacenter Regional datacenter Regional text cache cluster LVS CARP Squid Text cache cluster LVS CARP Squid Cache Squid Cache Squid Regional media cache cluster Application Media cache cluster LVS LVS CARP Squid CARP Squid Cache Squid Cache Squid Purge multicast Media storage

Origin server(s) Florida Primary cluster CARP sq1 sq2 sq3 sq4 Caching layer sq1 sq2 sq3 sq4 Destination URL Hash routing layer Amsterdam Clients knsq1 knsq2 knsq3 knsq1 knsq2 knsq3...

Squid cache invalidation Wiki pages are edited at an unpredictable rate Only the latest revision of a page should be served at all times in order not to hinder collaboration Invalidation through expiry times not acceptable, explicit cache purging needs to be done Implemented using the UDP based HTCP protocol: on edit application servers send out a single message containing the URL to be invalidated, which is delivered over multicast to all subscribed Squid caches

The Wiki software All Wikimedia projects run on a MediaWiki platform Open Source software (GPL) Designed primarily for use by Wikipedia/Wikimedia, but also used by many outside parties Arguably the most popular wiki engine out there Written in PHP Very scalable, very good localization Storage primarily in MySQL, other DBMSes supported

MediaWiki MediaWiki in our application server platform: ~ 125 servers, 40 waiting to be setup MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box) One centrally managed & synchronized software installation for hundreds of wikis Hardware shared with External Storage and Memcached tasks

MediaWiki caching Caches everywhere Tell about Memcached, no dedicated slide Sometimes caching costs more than recalculating or looking up at the data source... profiling! Most of this data is cached in Memcached, a distributed object cache Content acceleration & distribution network Image metadata Users & Sessions Parser cache MediaWiki Primary interface language cache Difference cache Revision text cache Secondary interface language cache

MediaWiki dependencies A lot of dependencies in our setup Apache FastStringSearch Proctitle PHP APC MediaWiki Imagemagick Tex DjVu rsvg ploticus

MediaWiki optimization We try to optimize by... not doing anything stupid avoiding expensive algorithms, database queries, etc. caching every result that is expensive and has temporal locality of reference focusing on the hot spots in the code (profiling!) If a MediaWiki feature is too expensive, it doesn t get enabled on Wikipedia

http://noc.wikimedia.org/cgi-bin/report.py

MediaWiki profiling

Persistent data Persistent data is stored in the following ways: Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases Actual revision text is stored as blobs in External storage Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches

Core databases Separate database per wiki (not separate server!) One master, many replicated slaves Read operations are load balanced over the slaves, write operations go to the master The master is used for some read operations in case the slaves are not yet up to date (lagged) Runs on ~15 DB servers with 4-16 GB of memory, 6x 73-146 GB disks and 2 CPUs each

Core database scaling Scaling by: Separating read and write operations (master/slave) Separating some expensive operations from cheap and more frequent operations (query groups) Separating big, popular wikis from smaller wikis Improves caching: temporal and spatial locality of reference and reduces the data set size per server

Core database schema Langlinks Externallinks Categorylinks Text Restrictions Page Revision Pagelinks Recent changes Imagelinks Templatelinks Watchlist Image Oldimage User Blocks File archive User groups Logging Jobs... Site stats Profiling

External Storage Article text is stored on separate data storage clusters Simple append-only blob storage Saves space on expensive and busy core databases for largely unused data Allows use of spare resources on application servers (2x 250-500 GB per server) Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

Text compression All revisions of all articles are stored Every Wikipedia article version since day 1 is available Many articles have hundreds, thousands or tens of thousands of revisions Most revisions differ only slightly from previous revisions Therefore subsequent revisions of an article are concatenated and then compressed Achieving very high compression ratios of up to 100x

Media storage Currently: 1 Image server containing 1.3 TB of data, 4 million files Overloaded serving just 100 requests/s Media uploads and deletions over NFS, mounted on all application servers Not scalable Backups take ~ 2 weeks

Media storage New API between media storage server and application servers, based on HTTP Methods store, publish, delete and generate thumbnail New file / directory layout structure, using content hashes for file names Files with the same name/url will have the same content, no invalidation necessary Migration to some distributed, replicated setup

Thumbnail generation 404? Ask application servers to generate thumbnail stat() on each request is too expensive, so assume every file exists If a thumbnail doesn t exist, ask the application servers to render it Image server Squid caches Application servers Clients