Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Similar documents

How, What, and Where of Data Warehouses for MySQL

Oracle Architecture, Concepts & Facilities

Cassandra vs MySQL. SQL vs NoSQL database comparison

Database Administration with MySQL

SCALABLE DATA SERVICES

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

In Memory Accelerator for MongoDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Practical Cassandra. Vitalii

Tushar Joshi Turtle Networks Ltd

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Parallel Replication for MySQL in 5 Minutes or Less

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Big data blue print for cloud architecture

Tier Architectures. Kathleen Durant CS 3200

MySQL High Availability Solutions. Lenz Grimmer OpenSQL Camp St. Augustin Germany

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Welcome to Virtual Developer Day MySQL!

New Features in MySQL 5.0, 5.1, and Beyond

Active/Active DB2 Clusters for HA and Scalability

Social Networks and the Richness of Data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

DBA Tutorial Kai Voigt Senior MySQL Instructor Sun Microsystems Santa Clara, April 12, 2010

High Availability and Scalability for Online Applications with MySQL

Integrating Big Data into the Computing Curricula

SQL Server PDW. Artur Vieira Premier Field Engineer

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

Apache Kylin Introduction Dec 8,

Database Scalability and Oracle 12c

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

Application-Tier In-Memory Analytics Best Practices and Use Cases

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

How To Scale Myroster With Flash Memory From Hgst On A Flash Flash Flash Memory On A Slave Server

Putting Apache Kafka to Use!

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Linas Virbalas Continuent, Inc.

Preventing con!icts in Multi-master replication with Tungsten

Database Replication with MySQL and PostgreSQL

Availability Digest. MySQL Clusters Go Active/Active. December 2006

MySQL Technical Overview

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

MySQL Storage Engines

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

High Availability Solutions for the MariaDB and MySQL Database

A programming model in Cloud: MapReduce

Evaluation of NoSQL databases for large-scale decentralized microblogging

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Replicating to everything

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

CitusDB Architecture for Real-Time Big Data

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

In-Memory Databases MemSQL

Comparing MySQL and Postgres 9.0 Replication

White Paper. Optimizing the Performance Of MySQL Cluster

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data? Definition # 1: Big Data Definition Forrester Research

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Managing MySQL Scale Through Consolidation

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Cloud Based Application Architectures using Smart Computing

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Partitioning under the hood in MySQL 5.5

Big Data with Component Based Software

1.0 Hardware Requirements:

FLASH ARRAY MARKET TRENDS

Using RDBMS, NoSQL or Hadoop?

EII - ETL - EAI What, Why, and How!

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

High Availability Using MySQL in the Cloud:

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

MySQL Cluster Deployment Best Practices

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Leveraging Public Clouds to Ensure Data Availability

Hadoop Size does Hadoop Summit 2013

ScaleArc idb Solution for SQL Server Deployments

Availability Digest. Raima s High-Availability Embedded Database December 2011

MakeMyTrip CUSTOMER SUCCESS STORY

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

A Performance Analysis of Distributed Indexing using Terrier

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Open Source DBMS CUBRID 2008 & Community Activities. Byung Joo Chung bjchung@cubrid.com

Can the Elephants Handle the NoSQL Onslaught?

From Spark to Ignition:

Configuring Apache Derby for Performance and Durability Olav Sandstå

Database Scalability {Patterns} / Robert Treat

Transcription:

Real-time reporting at 10,000 inserts per second Wesley Biggs CTO 25 October 2011 Percona Live

Agenda 1. Who we are, what we do, and (maybe) why we do it 2. Solution architecture and evolution 3. Top 5 tips and tricks 4. The next 10,000 and beyond

What we do Adfonic is a London-based startup launched in 2009. We power display advertising on all types of mobile devices. More than 8,000 mobile web and app publishers Over 3,000 ad campaigns each month >15 billion user interactions per month Peak load October 2011: > 10,000 event insertions per second ACID requirements events must all be counted toward revenue (CPC and CPM ad campaigns) Self-service management and reporting tools w/real-time data

Who am I Co-founder and CTO Software Architect Systems Architect Java Developer IT procurement Tech hiring and HR R&D lead Makes a mean cup of tea (ahem) DBA?

Our toolbox We made a decision early on to use best-of-breed free and open source technologies where possible. CentOS Linux Java 6 MySQL (currently 5.1) Apache ActiveMQ Pentaho Mondrian ROLAP Follow the KISS principle Remember, for every problem there exists a solution which is both simple and wrong Avoid technology sprawl: 5 tools that each produce a 10% gain are rarely worth putting together.

The vision Record events from the adserver (requests, impressions, clicks) and make them available for real-time viewing via the web site, broken down by device, location, mobile operator, etc. Real-time reporting means data is available to be queried via the web-based self-service interface within seconds. Adserver application Database Web application if only it were that simple!

Hidden requirements Auditability keep 3 months of full data set around (already in the 6+ TB range) Required Aggregation quickly became apparent we couldn t query on the whole raw dataset Sharding separate reporting data for internal vs. external use Metadata replication must be able to join against names, etc. at the reporting tier 24/7 availability, performance and all that rot

The reality Lots of endpoint nodes (in multiple geographic locations) Post-processing of events to help latency on the endpoints MySQL Cluster for write scalability and H/A Query performance via aggregate tables and ROLAP Adserver application Message queue Event postprocessor MySQL Cluster Aggregation tier Reporting database Web application

Relational OLAP and Mondrian Open source tool Uses the MDX language No ETL required for data warehouse (vs. BusinessObjects) Looks for very particularly named aggregate tables Drawbacks: Slow startup performs count(*) on each aggregate table Workaround: Hacked Mondrian to call an approx_count() function that uses a lookaside table in MySQL Issues complex queries that can tax the query planner

Tip 0: Use replication. Also, use replication. It s seriously fast, even as a data copy mechanism Use row-based replication (5.1+) The replication chain can span multiple engine types and performance configurations Metadata tier (InnoDB, SSD) Event tier (NDB) Aggregation tier (InnoDB, SSD) Reporting tier (InnoDB, fixed disk)

Tip 1: Use hash indexes on NDB Factor of 10x or more write performance on cluster CREATE TABLE AD_EVENT_LOG ( ID BIGINT PRIMARY KEY USING HASH... ) ENGINE=ndbcluster; You can still use autoincrement (MySQL Cluster handles this well and lets you control contiguous ID blocks) Replication slaves using InnoDB can revert to BTree

Tip 2: Use events, not cron jobs Lifecycle tied to the MySQL server Works just like a stored procedure Easy management (alter event foo [disable enable]) Write events with global mutexes and 1s repeat to accomplish constantly running processes, e.g. aggregation create event E_RUN_AGGREGATES on schedule every 1 second do BEGIN IF GET_LOCK( ADFONIC_AGG_LOCK,1) THEN { do some work and know we re alone } SELECT RELEASE_LOCK( ADFONIC_AGG_LOCK ); END IF; END

Tip 3: Use the ENUM datatype Enormous data savings over CHAR or VARCHAR Stored as an integer, so easily indexed No impact on client libraries (these still assume VARCHAR) Don t try this at home: InnoDB DDL files can be edited if you ever need to associate existing ENUM values with different names AD_ACTION ENUM ('UNFILLED_REQUEST','IMPRESSION','CLICK','INST ALL','AD_SERVED','CONVERSION','BID_FAILED') NOT NULL

Tip 4: Use sql_log_bin to diversify Turning off binary logging at the right time allows the preservation of different configurations at different stages of replication. Separate data replication from access patterns Control partitioning on a per-tier basis Manage indexes for different performance criteria CREATE PROCEDURE update_daily_partitions() BEGIN SET sql_log_bin=0; ALTER TABLE foo DROP PARTITION yesterday; ALTER TABLE foo ADD PARTITION ( tomorrow ); SET sql_log_bin=1; END

Tip 5: Avoid insert select with replication Aggregation use case: read from replicated event table, write to aggregate tables INSERT INTO foo SELECT FROM bar [ON DUPLICATE KEY UPDATE ] Great for compact SQL, but locks bar as well as foo unless you can live with READ_COMMITTED isolation. In our case no, so that means replication is blocked during the select. Instead, replace with cursors over the selects (work in progress).

Tip 6: Normalize event-stream data Normalized data keeps the event-level data set small BUT read-compare-update-insert semantics are slow Example: User-agent strings large and unwieldy, but constantly changing Solution: Treat MySQL Cluster as write only ; only reads are via replication Implication: The post-processor must retain the foreign key database in memory and know when to insert. Use message queue to synchronize multiple post-processors.

Wish list/challenges/needs Canonical (& fast) estimated row count for InnoDB Improve DDL replication between different storage engines Transactional TRUNCATE TABLE for use with aggregation Easier multithreaded transactions aggregation again HOWTO for HA replication to cluster Smarter locking on INSERT SELECT Pinnable query plans (or at least more transparency) Easy offlining of date-based partitions (hot backup may address)

The next 10,000 (and beyond) Sharding and horizontal scaling are ultimately the only tools at your disposal Scale the cluster horizontally just add hardware Parallelize aggregation multiple concurrent insert/selects for performance Sharded replication to customer-centric reporting databases requires some duplication of data (advertiser x publisher) Federation/spider topology Early vs. late aggregation (not all needs to be real-time) More hacking on Mondrian to make MySQL-friendly: forced indexes, etc.

Thank you. Questions? Contact me: Wes Biggs Email wes.biggs@adfonic.com Twitter @wbiggs We re hiring! London-based DBA position available immediately See me for details or visit http://adfonic.com Email jobs@adfonic.com