The Game of Big Data! Analytics Infrastructure at KIXEYE

Similar documents
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Data Pipeline with Kafka

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Best Practices for Hadoop Data Analysis with Tableau

Big Data Analytics Nokia

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data Analytics - Accelerated. stream-horizon.com

Replicating to everything

Putting Apache Kafka to Use!

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Analytics Roadmap Energy Industry

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Luncheon Webinar Series May 13, 2013

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Why Big Data in the Cloud?

Real Time Big Data Processing

Oracle Big Data SQL Technical Update

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Patterns. Ron Bodkin Founder and President, Think Big

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

Sentimental Analysis using Hadoop Phase 2: Week 2

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

KNIME & Avira, or how I ve learned to love Big Data

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

How To Use Data For Abstract Reasoning

Analytics on Spark &

Big Data? Definition # 1: Big Data Definition Forrester Research

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Real-Time Data Analytics and Visualization

Making Sense of Big Data in Insurance

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Apache Kylin Introduction Dec 8,

VIEWPOINT. High Performance Analytics. Industry Context and Trends

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Data Integration Checklist

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data With Hadoop

Hadoop & Spark Using Amazon EMR

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Understanding the Value of In-Memory in the IT Landscape

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

AllegroGraph. a graph database. Gary King gwking@franz.com

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

the missing log collector Treasure Data, Inc. Muga Nishizawa

REAL-TIME BIG DATA ANALYTICS

Using distributed technologies to analyze Big Data

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Next-Generation Cloud Analytics with Amazon Redshift

Cloud Big Data Architectures

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Big Data & Cloud Computing. Faysal Shaarani

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Integrating VoltDB with Hadoop

Evaluation of NoSQL databases for large-scale decentralized microblogging

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Big Data Pipeline and Analytics Platform

Massive Cloud Auditing using Data Mining on Hadoop

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Moving From Hadoop to Spark

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Chase Wu New Jersey Ins0tute of Technology

Hadoop & its Usage at Facebook

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Big Data & Netflix. Paul Ellwood February 9th, 2015

Traditional BI vs. Business Data Lake A comparison

Trafodion Operational SQL-on-Hadoop

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Transcription:

The Game of Big Data! Analytics Infrastructure at KIXEYE Randy Shoup @randyshoup linkedin.com/in/randyshoup QCon New York, June 13 2014

Free-to-Play Real-time Strategy Games Web and mobile Strategy and tactics Really real-time J Deep relationships with players Constantly evolving gameplay, feature set, economy, balance >500 employees worldwide

Intro: Analytics at KIXEYE User Acquisition Game Analytics Retention and Monetization Analytic Requirements

User Acquisition Goal: ELTV > acquisition cost User s estimated lifetime value is more than it costs to acquire that user Mechanisms Publisher Campaigns On-Platform Recommendations

Game Analytics Goal: Measure and Optimize Fun Difficult to define Includes gameplay, feature set, performance, bugs All metrics are just proxies for fun (!) Mechanisms Game balance Match balance Economy management Player typology

Retention and Monetization Goal: Sustainable Business Monetization drivers Revenue recognition Mechanisms Pricing and Bundling Tournament ( Event ) Design Recommendations

Analytic Requirements Data Integrity and Availability Cohorting Controlled experiments Deep ad-hoc analysis

Deep Thought V1 Analytic System Goals Core Capabilities Implementation

Deep Thought V1 Analytic System Goals Core Capabilities Implementation

V1 Analytic System Grew Organically Built originally for user acquisition Progressively grown to much more Idiosyncratic mix of languages, systems, tools Log files -> Chukwa -> Hadoop -> Hive -> MySQL PHP for reports and ETL Single massive table with everything

V1 Analytic System Many Issues Very slow to query No data standardization or validation Very difficult to add a new game, report, ETL Extremely difficult to backfill on error or outage Difficult for analysts to use; impossible for PMs, designers, etc. but we survived (!)

Deep Thought V1 Analytic System Goals Core Capabilities Implementation

Goals of Deep Thought Independent Scalability Logically separate, independently scalable tiers Stability and Outage Recovery Tiers can completely fail with no data loss Every step idempotent and replayable Standardization Standardized event types, fields, queries, reports

Goals of Deep Thought In-Stream Event Processing Sessionalization, Dimensionalization, Cohorting Queryability Structures are simple to reason about Simple things are simple Analysts, Data Scientists, PMs, Game Designers, etc. Extensibility Easy to add new games, events, fields, reports

Deep Thought V1 Analytic System Goals Core Capabilities Implementation

Core Capabilities Sessionalization Dimensionalization Cohorting

Sessionalization All events are part of a session Explicit start event, optional stop event Game-defined semantics Event Batching Events arrive in batch, associated with session Pipeline computes batch-level metrics, disaggregates events Can optionally attach batch-level metrics to each event

Sessionalization Time-Series Aggregations Configurable metrics 1-day X, 7-day X, lifetime X Total attacks, total time played Accumulated in-stream V1 aggregate + batch delta Faster to calculate in-stream vs. Map-Reduce

Dimensionalization Pipeline assigns unique numeric id to string enums E.g., twigs resource id 1234 Automatic mapping and assignment Games log strings Pipeline generates and maps ids No configuration necessary Fast dimensional queries Join on integers, not strings

Dimensionalization Metadata enumeration and manipulation Easily enumerate all values for a field Merge multiple values TWIGS == Twigs == twigs Metadata tagging Can assign arbitrary tags to metadata E.g., Panzer 05 is {tank, mechanized infantry, event prize} Enables custom views

Cohorting Group players along any dimension / metric Well beyond classic age-based cohorts Core analytical building block Experiment groups User acquisition campaign tracking Prospective modeling Retrospective analysis

Cohorting Set-based Overlapping groups: >100, >200, etc. Exclusive groups: (100-200), (200-500), etc. Time-based E.g., people who played in last 3 days E.g., whale == ($$ > X) in last N days Autoexpire from a group without explicit intervention

Deep Thought V1 Analytic System Goals Core Capabilities Implementation

Implementation of Pipeline Ingestion Event Log Transformation Data Storage Analysis and Visualization Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Ingestion: Logging Service HTTP / JSON Endpoint Play framework Non-blocking, event-driven Responsibilities Message integrity via checksums Durability via local disk persistence Async batch writes to Kafka topics {valid, invalid, unauth} Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Event Log: Kafka Persistent, replayable pipe of events Events stored for 7 days Responsibilities Durability via replication and local disk streaming Replayability via commit log Scalability via partitioned brokers Segment data for different types of processing Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Transformation: Importer Consume Kafka topics, rebroadcast E.g., consume batches, rebroadcast events Responsibilities Batch validation against JSON schema Syntactic validation Semantic validation (is this event possible?) Batches -> events Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Transformation: Importer Responsibilities (cont.) Sessionalization Assign event to session Calculate time-series aggregates Dimensionalization String enum -> numeric id Merge / coalesce different string representations into single id Player metadata Join player metadata from session store Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Transformation: Importer Responsibilities (cont.) Cohorting Process enter-cohort, exit-cohort events Process A / B testing events Evaluate cohort rules (e.g., spend thresholds) Decorate events with cohort tags Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Transformation: Session Store Key-value store (Couchbase) Fast, constant-time access to sessions, players Responsibilities Store Sessions, Players, Dimensions, Config Lookup Idempotent update Store accumulated session-level metrics Store player history Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Storage: Hadoop 2 Camus MR Kafka -> HDFS every 3 minutes append_events table Append-only log of events Each event has session-version for deduplication Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Storage: Hadoop 2 append_events -> base_events MR Logical update of base_events Update events with new metadata Swap old partition for new partition Replayable from beginning without duplication Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Storage: Hadoop 2 base_events table Denormalized table of all events Stores original JSON + decoration Custom Serdes to query / extract JSON fields without materializing entire rows Standardized event types lots of functionality for free Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Analysis and Visualization Hive Warehouse Normalized event-specific, gamespecific stores Aggregate metric data for reporting, analysis Maintained through custom ETL MR Hive queries Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Analysis and Visualization Amazon Redshift Fast ad-hoc querying Tableau Simple, powerful reporting Logging Service Ka.a Importer / Session Store Hadoop 2 Hive / Redshi:

Come Join Us! KIXEYE is hiring in SF, Seattle, Victoria, Brisbane, Amsterdam rshoup@kixeye.com @randyshoup Deep Thought Team: Mark Weaver Josh McDonald Ben Speakmon Snehal Nagmote Mark Roberts Kevin Lee Woo Chan Kim Tay Carpenter Tim Ellis Kazue Watanabe Erica Chan Jessica Cox Casey DeWitt Steve Morin Lih Chen Neha Kumari