Big Data Patterns. Ron Bodkin Founder and President, Think Big

Similar documents
The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Integrating a Big Data Platform into Government:

The Internet of Things and Big Data: Intro

Using RDBMS, NoSQL or Hadoop?

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

How Companies are! Using Spark

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Oracle Big Data SQL Technical Update

Unified Big Data Processing with Apache Spark. Matei

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HDP Hadoop From concept to deployment.

Making Sense of Big Data in Insurance

Ganzheitliches Datenmanagement

NoSQL for SQL Professionals William McKnight

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Getting Real Real Time Data Integration Patterns and Architectures

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Information Builders Mission & Value Proposition

Dell In-Memory Appliance for Cloudera Enterprise

Bringing Big Data into the Enterprise

Analytics on Spark &

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Data Integration Checklist

The Big Data Paradigm Shift. Insight Through Automation

Big Data and Your Data Warehouse Philip Russom

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Data Modeling for Big Data

CitusDB Architecture for Real-Time Big Data

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

The Game of Big Data! Analytics Infrastructure at KIXEYE

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

HDP Enabling the Modern Data Architecture

Why Big Data in the Cloud?

Cost-Effective Business Intelligence with Red Hat and Open Source

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

The Enterprise Data Hub and The Modern Information Architecture

How to Choose Between Hadoop, NoSQL and RDBMS

How To Scale Out Of A Nosql Database

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data With Hadoop

Putting Apache Kafka to Use!

EMC SOLUTION FOR SPLUNK

How To Make Data Streaming A Real Time Intelligence

Hadoop Ecosystem B Y R A H I M A.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Big Data on Microsoft Platform

Distributed Computing and Big Data: Hadoop and MapReduce

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

NAVIGATING THE BIG DATA JOURNEY

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Advanced In-Database Analytics

More Data in Less Time

Big Data Can Drive the Business and IT to Evolve and Adapt

Testing Big data is one of the biggest

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Scaling Out With Apache Spark. DTL Meeting Slides based on

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Time-Series Databases and Machine Learning

Big Data and Its Impact on the Data Warehousing Architecture

How To Use Big Data For Telco (For A Telco)

Real-Time Data Analytics and Visualization

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Understanding the Value of In-Memory in the IT Landscape

Real Time Big Data Processing

Big Data Analytics Nokia

Best Practices for Hadoop Data Analysis with Tableau

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Transcription:

Big Data Patterns Ron Bodkin Founder and President, Think Big 1

About Me Ron Bodkin Founder and President, Think Big I have 9 years experience working with Big Data and Hadoop. In 2010, I founded Think Big to help companies realize measurable value from Big Data. Our expertise spans all facets of data science and data engineering and helps our customers drive maximum value from their Big Data initiatives. Patterns in this talk from large-scale deployments in high tech manufacturing & digital marketing. Follow me at @ronbodkin 2

Agenda Context Patterns Conclusions 3

Big Data: The Key is Variety Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important it s variety 4

How is Information Management Changing? Schema on Read? Yes as step one But data still has underlying structure It s more like agile modeling reflect as much structure as needed Loosely coupled schemas without platform guarantees but enable more application flexibility Data Modeling isn t dead! Metadata is more important than ever Data Warehouses embracing Big Data principles (e.g., elasticity, JSON ) 5

Changes in the Platform Entry Level Hadoop cluster circa 2015 (20 nodes) 240 cores 1 PB spinning disk 10 TB RAM 10-40 GbE Low software cost Disk transfer times increasing => many disks => DAS (2005-2020) Distributed RAM increasingly important to expedite computation although data volumes increasing faster The network will be the computer (really!) => you can distribute disks separately across high bandwidth fabrics (2020+) Changes many assumptions in traditional physical modeling 6

Changes in Logical Modeling JSON-like structures Complex collections of relations, arrays, map of items Graphs Storing complex, dynamically changing not static relationships Binary/CLOB/specialized data Ability to execute specialized programs to interpret and process 7

Changes in Physical Modeling 8 Big Data unpacks the database metaphor Data distribution: key design, sharding/distribution, file formats Multiple computational algorithms, e.g., MapReduce, Computational Graph (Spark, Tez), data flow, streaming, graph engines Integrity is an application concern Storage is cheap Denormalization and materialized views common Yet compression is popular often for IO savings Summarization is orders of magnitude more powerful Index lookups are increasingly costly Distributed systems impose eventual consistency, reconciliation demands

9 Leading Financial Asset Manger Financial Services Photo courtesy of Flickr. Creative Commons Challenge Siloed consumer analytics Lack of agility in analysis Slow ETL Solution Scalable ETL Discovery analytics tech & process Cross-channel data science models Cloudera Enterprise, HBase, Greenplum Results Scalable Processing Extracted customer behavior signals from raw data for existing and new behavior models Faster time to insight

Leading Enterprise Tech Component Vendor 10 High Tech Manufacturing Photo courtesy of Flickr. Creative Commons Challenge Data search parties waste engineers time Excess scrap waste, slow time to market Reactive analytics model Solution Scalable data lake Search and deep analytic queries Integrated assembly insights for data science models Hive, Impala, Red Shift, Elastic search Big data training and hackathons Results Supply chain line of sight from R&D, manufacturing, to servicing at customer sites End-to-end proactive analytics: reduced development time, improved manufacturing yield, increased customer satisfaction Proactive, scale analytics led to better engineering theory

Patterns 11

Important New Patterns 12 Denormalized Fact Profile Event History Timeline Assembly Distributed Sources Late Data Deep Aggregates Recovery Multiple Active Clusters

Event History Fact table about common events to allow e.g., cross-channel analytics in context E.g., clickstream, posts, purchases, content consumption, device activity Stored in columnar format (e.g., Parquet, ORCfile) Join as was value of slowly changing dimensions Often extension column of unparsed/not modeled JSON-like data Partitioned by event time buckets, perhaps also by other dimension(s) Event id Actor id Time Event col s Dim id s Dim col s Ext. Data 123 uid1 1/1/15 13:16:11 456 uid2 1/1/15 13:16:14 { TstA : 1 } { TstB : 1 } 13

Timeline Pivot on event history: table of actors with events over time Customer journey, device history Enable support/analysis on specific items, long-lived analysis May have hierarchy of actors (e.g., household, individual, device) May be array of events, many columns or subsorted (cluster key) Also stored in columnar format, may be partitioned May be updated in near real-time AND batch Often holds cached alogirthm values (combined Profile) Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id uid1 [1, 3, 7] 123 1/1/15 13:16:11 { TstA : 1 } 789 uid2 [2, 3] 456 1/1/15 13:16:14 { TstB : 1 } 0ab 14

Event Analytics Propensity/segmentation May be scored in real-time using Timeline/Profile May be hybrid scored batch using Event History Trained from timeline Attribution Score impact of past events on new event (e.g., purchase, churn) Algorithms range from simple rules to Shapley value Natural in timeline Reporting, exploration Often via Deep Aggregates, using HyperLogLog Discovery 15

Event Data Management Identity merge Discovery of new identities (e.g., cookie logs in, Facebook connect) Indirection or rewrites Requires rescoring Expiration/archival Efficiency, policy requirements Governance Lineage & security 16

Network Ongoing status of configuration Parts in assembly Related items (versions) Social groups Can be people, devices etc. Maintain links in graph structure May be current or historical Use links to pull full context from Event History or Timeline Search -> simple query -> complex analytics E.g., transitive closure, impact analysis Technologies Giraph, GraphX TitanDB, Neo4j 17

Distributed Sources Unlike simple all or nothing feeds May have many distributed sources feeding data It s critical to know whether all (or enough) data has arrived Goals only produce analytic results when sufficient provide provenance timeliness & completeness statistics Need SLA s about timeliness and required fraction of data Control totals Metadata about process (expected lineage) Heartbeats/configuration 18 Root cause of complexity of ingestion

Late Data Data may be delayed due to Upstream system failures (server down esp. with unreliable delivery, network outage) Offline/disconnected devices (endemic with mobile & IoT) Metadata to track lineage is critical Define delay time where with high confidence sufficient data has arrived Process authoritative derived data after that time May process incremental/incomplete data earlier (a la economic statistics) May re-process in emergency (restatement) May include changed data in later period Report on how much data has arrived late Implementation: bucket on event time, secondary on delay epoch (partitions for late data) Zipfian Distribution 19

Conclusions 20

Probabilistic Data Structures 21 Increasingly valuable as an optimization technique, e.g., Bloom filters Hashed key values for array Check key to see if may be present indexing/filtering sparse reads HyperLogLog, sketch sets Multiple hashes used to estimate count of unique items Far more space compact (KB s to count billions of items +/- 2%) Can be composed (unlike exact unique counts) e.g., across time, categories MinHash Least hashed value in common between two sets Used to identify duplicates, estimate overlap in arbitrary sets

Anti-Patterns 3 rd Normal Form, Star Schema, Snowflake Schema Index lookups slow in general Focus on partitioned reads not disk seeks Poor results in practice Not natural representations for repeating events, nested structure Use of SSD, maturing optimizers, platform updates (Kudu?) are slowly improving an industry would love this to happen Expect data marts to work in Big Data before data warehouses do 22

Conclusions Much of Big Data today is trade-craft Learned lore & derived from first principles As we scale data lakes & analytics, critical to have common vocabulary, shared understandings I d love your input on common patterns & practices Look for blogs with more depth on each pattern at http://thinkbig.teradata.com/author/rbodkin/ Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 23