Crack Open Your Operational Database. Jamie Martin jameison.martin@salesforce.com September 24th, 2013



Similar documents
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

CitusDB Architecture for Real-Time Big Data

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Apache Kylin Introduction Dec 8,

Splice Machine: SQL-on-Hadoop Evaluation Guide

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Analytics on Spark &

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Apache HBase. Crazy dances on the elephant back

Trafodion Operational SQL-on-Hadoop

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Using distributed technologies to analyze Big Data

NoSQL for SQL Professionals William McKnight

IST722 Data Warehousing

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

The Sierra Clustered Database Engine, the technology at the heart of

Play with Big Data on the Shoulders of Open Source

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Reusable Data Access Patterns

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

REAL-TIME BIG DATA ANALYTICS

Cloud Computing at Google. Architecture

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

The Internet of Things and Big Data: Intro

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Architectures for Big Data Analytics A database perspective

BIG DATA What it is and how to use?

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Using RDBMS, NoSQL or Hadoop?

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Luncheon Webinar Series May 13, 2013

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop Ecosystem B Y R A H I M A.

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

The Hadoop Eco System Shanghai Data Science Meetup

Making Sense of Big Data in Insurance

How to Enhance Traditional BI Architecture to Leverage Big Data

In-memory databases and innovations in Business Intelligence

Hypertable Architecture Overview

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Oracle Big Data SQL Technical Update

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

In-Memory Data Management for Enterprise Applications

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Please give me your feedback

Big Data Analytics Nokia

Big Data SQL and Query Franchising

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Understanding the Value of In-Memory in the IT Landscape

Big Data Analytics - Accelerated. stream-horizon.com

New Eco-Systems in the software and service domain in the Cloud area

Oracle InMemory Database

How to Choose Between Hadoop, NoSQL and RDBMS

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

Data processing goes big

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Cost-Effective Business Intelligence with Red Hat and Open Source

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Databases in Organizations

A Brief Introduction to Apache Tez

Advanced Big Data Analytics with R and Hadoop

New Modeling Challenges: Big Data, Hadoop, Cloud

Exploring the Synergistic Relationships Between BPC, BW and HANA

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Dominik Wagenknecht Accenture

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

High-Volume Data Warehousing in Centerprise. Product Datasheet

WINDOWS AZURE DATA MANAGEMENT

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

Certified Big Data and Apache Hadoop Developer VS-1221

Oracle Architecture, Concepts & Facilities

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Large scale processing using Hadoop. Ján Vaňo

Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh

INTRODUCTION TO CASSANDRA

Using Data Mining and Machine Learning in Retail

Transcription:

Crack Open Your Operational Database Jamie Martin jameison.martin@salesforce.com September 24th, 2013

Analytics on Operational Data Most analytics are derived from operational data Two canonical approaches in-situ: run analytics on operational store ex-situ: move data (ETL) to optimized store Crack open the operational database direct external access into live OLTP database

In-Situ Analytics on Operational Data is Limited OLTP databases are not built for analytics Operational databases are usually resource constrained built for short transactions, simple queries over modest data sets limited query expressiveness storage impedance mismatch (e.g. row vs. column) hybrids exist, but cannot bridge the gap across all dimensions limited CPU, cache and IOPS for analytics long running queries cause lock conflicts or MVCC inefficiencies Run analytics on an optimized analytics engine optimized columnar stores massively scalable compute engines fast aggregation engines (OLAP)

Ex-situ Analytics is an ETL Nightmare Get a snapshot of operational store Run analytics somewhere else use some other compute perhaps with more optimal storage Capture ongoing changes from OLTP engine must be non-disruptive usually needs to be transactionally consistent often need to keep the analytics live In practice this is really really painful ETL nightmare: expensive, rigid, slow, fragile data governance and provenance problems

Hadoop: Unconstrained Data Access Open data ecosystem distributed storage: HDFS distributed compute: M/R (YARN) really interesting data access possibilities Unconstrained access to data data files are all out there in the wild on HDFS storage formats are typically public (implement M/R Input/OutputFormat) ecosystem encourages integration (can run M/R directly on HBase HFiles) this is very different than your typical DBMS

OLTP Enabled Analytics: Snapshots Goal: take snapshot directly against an active database Getting to bytes is complicated understand semantics of data - columns, datatypes for table T logical to physical mapping - where is the data physical consistency - coordination with writers transactional consistency persistence formats - understanding data layout (rows, columns, etc) Traditional database is a black box contents of table: system catalogs, JDBC metadata, etc where is the data: table spaces->dbs->partitions->...->extents->pages physical consistency: in-memory latches, pinning transactional consistency: in-memory lock tables, MVCC information persistence formats: proprietary data queries results

Direct Snapshots An approach to direct external access logical to physical data mapping externalized through public catalog service find the specific persistent artifacts that contain desired data DBMS abdicates space management physical consistency without latching immutable storage (not Aries) anyone can read persistence w/o coordination transactional consistency through MVCC records contain transaction information consistent point in time via filters on data (not PiTR) published persistence formats These are the same techniques that are needed to scale up and out MVCC & immutable data to scale up cross node catalog describing persistence to scale out

Taking a Snapshot Snapshot acquisition obtain snapshot for table T locate immutable artifacts that may have data for table T register interest in them as of a point in time (MVCC) get a consistent snapshot access the data directly, with impunity direct analytics, e.g. M/R on OLTP data dump in secondary system for subsequent analytics release snapshot Consistency without fine-grained coordination

Change Detection Allow direct external access to OLTP transaction log Externalized access transaction log as a externally meaningful data stream track transaction logs in external catalog physical consistency - logs are already append only/immutable transactional consistency - tie data MVCC to log records published log formats Models pull log chunks as needed apply them to snapshots push log records on a data bus enables streaming analytics

Challenges Schema evolution snapshot cannot require DDL coordination hard to receive schema changes from the firehose of changes Externalizing persistence formats is easier said than done