Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Similar documents

How To Scale Out Of A Nosql Database

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Apache Kylin Introduction Dec 8,

How To Handle Big Data With A Data Scientist

Parquet. Columnar storage for the people

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Comparing SQL and NOSQL databases

Hadoop IST 734 SS CHUNG

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Trafodion Operational SQL-on-Hadoop

Moving From Hadoop to Spark

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

BIG DATA HADOOP TRAINING

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Completing the Big Data Ecosystem:

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Integration of Apache Hive and HBase

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Comprehensive Analytics on the Hortonworks Data Platform

ITG Software Engineering

Virtualizing Apache Hadoop. June, 2012

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

TRAINING PROGRAM ON BIGDATA/HADOOP

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Accelerating and Simplifying Apache

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Introduction to Big Data Training

Big Data? Definition # 1: Big Data Definition Forrester Research

Hadoop Ecosystem B Y R A H I M A.

Brisk: More Powerful Hadoop Powered by Cassandra.

Using Kafka to Optimize Data Movement and System Integration. Alex

How To Use Big Data For Telco (For A Telco)

Apache Flink Next-gen data analysis. Kostas

Move Data from Oracle to Hadoop and Gain New Business Insights

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

The Hadoop Eco System Shanghai Data Science Meetup

Bringing Big Data to People

Using distributed technologies to analyze Big Data

Certified Big Data and Apache Hadoop Developer VS-1221

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Big Data and Scripting Systems build on top of Hadoop

Extending Hadoop beyond MapReduce

COURSE CONTENT Big Data and Hadoop Training

Time series IoT data ingestion into Cassandra using Kaa

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Self-service BI for big data applications using Apache Drill

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Xiaoming Gao Hui Li Thilina Gunarathne

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

Enterprise Operational SQL on Hadoop Trafodion Overview

Hadoop & Spark Using Amazon EMR

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Big Data Management and Security

Introduction to Hbase Gkavresis Giorgos 1470

Big Data and Scripting Systems build on top of Hadoop

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Big Data Course Highlights

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Cloudera Certified Developer for Apache Hadoop

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Design and Evolution of the Apache Hadoop File System(HDFS)

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Apache Hadoop: The Big Data Refinery

Application Development. A Paradigm Shift

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

A Brief Outline on Bigdata Hadoop

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Transcription:

Schema Design Patterns for a Peta-Scale World Aaron Kimball Chief Architect, WibiData

About me

Big Data Applications Applications Mobile Customer Relations Web Serving Analytics Data management, ML, and analytics Storage Hadoop & HBase

Major application targets MapReduce, Cascading, Crunch, Pig

Big Data Apps are hard to build Schemas, serialization & versioning Distributed deployment Communication between teams Batch + real-time capacity planning

Kiji is designed to help you build real-time Big Data Applications on Apache HBase + + 100% Apache 2 licensed

Kiji architecture

Leading design decisions Store your data in HBase Encode it using Avro An entity-centric table design Manage a data dictionary around tables Distribute writes across the cluster

Key features Work with big data in rich types with schema evolution Guides users to successful schema design Deployment of real-time model scoring

Processing big data in real time

Real time freshening

Producer functions Table: users User ID info derived email zip purchases recommendations Producer function Producers provide custom per-entity analytics operating on fine-grained data

The Kiji data model Table: users User ID info derived email zip purchases recommendations Kiji tables have a layout (schema) describing their columns and data types Each column can hold integers or strings as well as complex Avro records

The Kiji data model Table: users User ID info derived email zip purchases recommendations Rows indexed by unique key In this entity-centric model, all data for one entity (e.g., user) sits in one row

The Kiji data model Table: users User ID info derived email zip purchases recommendations Columns organized in 2-level hierarchy: info:email vs. derived:recommendations Data is grouped together logically in column families

The Kiji data model Table: users User ID info derived email zip purchases recommendations Each cell holds a timeseries solar solar purchases Historical values for a given (row, column) can be easily aggregated, iterated, or filtered

Why? Schema evolution Compact representation Rich set of types Generic API is attractive for framework level Plays nicely with Hadoop ecosystem (Input/OutputFormats, etc.)

Schema evolution Initial schema for info:location record Point { float x; float y; } Later schema adds a field record Point { float x; float y; float z = 0.0; } Applications expecting either the old or new data structure can read all records from disk, regardless of which was used when writing the data.

Efficient I/O with Avro Each row (entity) stores many complex-valued cells Data logically read and written together is physically stored in the same cell Sparse column storage allows fine-grained access to disparate elements of an entity for complex model building

Schema upgrade concerns Big data repositories are accessed by many versions of different apps concurrently Apps each evolve local schema definitions at different rates Flag day approaches to using new schemas do not work in practice Avro admits incompatible schema migrations

Incompatible migration example #1 #2 #3 record R { int x; string y; } record R { int x; } record R { int x; int y = 0; } Incompatibility detection requires a history of all schemas previously used

Poly-schema key-value storage Each column records lists of all current and prior schemas Separate lists for reader and writer schemas DDL for registering and unregistering them ALTER TABLE t ADD READER SCHEMA s... ALTER TABLE t DROP READER SCHEMA s2... Zookeeper used to check active clients

Conclusions Store data in an entity-centric fashion Hash-prefix row keys to avoid hot spots Use complex records in each column Granularity of records is app-specific tradeoff Each column is poly-schema: several potential views at a time!

Try Kiji today! Go to kiji.org and download the BentoBox Zero-config Hadoop + HBase + Kiji instance Batteries included 15-minute quickstart guide and a tutorial with full source code

aaron@wibidata.com