Synthetic Data Generation for Realistic Analytics Examples and Testing

Similar documents

HiBench Introduction. Carson Wang Software & Services Group

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Apache Flink Next-gen data analysis. Kostas

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Ali Ghodsi Head of PM and Engineering Databricks

Unified Big Data Analytics Pipeline. 连城

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop IST 734 SS CHUNG

Deploying Hadoop with Manager

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Accelerating and Simplifying Apache

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Workshop on Hadoop with Big Data

Big Data Analytics. Lucas Rego Drumond

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Peers Techno log ies Pv t. L td. HADOOP

Understanding Hadoop Performance on Lustre

Big Data Training - Hackveda

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Extending Hadoop beyond MapReduce

Hadoop & Spark Using Amazon EMR

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Apache Hadoop: Past, Present, and Future

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

MapReduce and Hadoop Distributed File System

Big Data on Microsoft Platform

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Big Data Analytics OverOnline Transactional Data Set

Ubuntu and Hadoop: the perfect match

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How Companies are! Using Spark

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HadoopRDF : A Scalable RDF Data Analysis System

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

How To Create A Data Visualization With Apache Spark And Zeppelin

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

BIG DATA What it is and how to use?

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Dell In-Memory Appliance for Cloudera Enterprise

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

DduP Towards a Deduplication Framework utilising Apache Spark

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Greenplum Analytics Workbench

Large scale processing using Hadoop. Ján Vaňo

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Big Business, Big Data, Industrialized Workload

Manifest for Big Data Pig, Hive & Jaql

BIG DATA APPLICATIONS

Integrating a Big Data Platform into Government:

A Brief Introduction to Apache Tez

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Cost-Effective Business Intelligence with Red Hat and Open Source

CSE-E5430 Scalable Cloud Computing Lecture 11

Implement Hadoop jobs to extract business value from large and varied data sets

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

ANALYTICS CENTER LEARNING PROGRAM

How To Scale Out Of A Nosql Database

CSE-E5430 Scalable Cloud Computing. Lecture 4

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Testing Big data is one of the biggest

Real Time Big Data Processing

Scalable Architecture on Amazon AWS Cloud

What's New in SAS Data Management

A Brief Outline on Bigdata Hadoop

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HPCHadoop: MapReduce on Cray X-series

Transcription:

Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/

Who Am I? Software Engineer at Red Hat Data Science Team, Emerging Technologies Evaluate open-source Big Data space Ensure software works for Red Hat customers Promote data science internally through consulting projects Apache BigTop PMC 2

Synthetic Data No licensing, privacy, or intellectual property concerns Scalable: Laptops to Clusters! More reliable than external data sets Enable more realistic example applications Enable more comprehensive testing than wordcount and TeraSort 3

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 4

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 5

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 6

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 7

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 8

Data Transformation and Summarization Pipeline Accounts Transform Raw Text Parse Clean & Validate Summarize Transform Raw Text Parse Clean & Validate Summarize Aggregate Cumulative Transform Raw Text Parse Clean & Validate Summarize Daily 9

Synthetic Data Sensitive Data Real data on cluster for scalability testing and validation Synthetic data for local development and testing Needed smaller data sets for checking calculations Total aggregation results requires re-running old pipeline Extra burden on operations team Delay for development team 10

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 11

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 12

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 13

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 14

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 15

Validation with Synthetic Data Data Generator Accounts Transformation and Summarization Pipeline Expected Daily Daily Cumulative Expected Cumulative Validation Script 16

Issues Tackled Error in account validation introduced while refactoring code Usage of the correct join types Validation of date-time operations Correct Output Formats 17

Apache BigTop BigPetStore Blueprints Problem domain: Transactions for a fictional chain of pet stores BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data Blueprints for big data ecosystem Hadoop: MapReduce / Pig / Hive / Mahout Spark Flink (in progress) 18

BigPetStore 19

BigPetStore HCFS 20

BigPetStore HCFS Core (RDDs) 21

BigPetStore HCFS Core (RDDs) Spark SQL 22

BigPetStore HCFS Core (RDDs) Spark SQL MLLib 23

Team Cluster ~10 nodes 40 cores, 400GB RAM per node 24

Potential Issues Infrastructure Storage Software Installation Software Upgrades Spark Configuration Tuning User Management 25

Real Stories Creating a new user User Gluster permissions incorrect Cluster upgrade Spark upgrade didn t take because of issue with Ansible role configuration Wiped out our spark.conf master / mesos settings wrong Gluster moint points disappeared on reboot Not set in fstab 26

k8petstore Users BPS Data Generator Public IP Proxy Web Application BPS Data Generator BPS Data Generator Redis Master Redis Slave Redis Slave Redis Slave 27

k8petstore Users BPS Data Generator Public IP Proxy Web Application BPS Data Generator BPS Data Generator Redis Master Redis Slave Redis Slave Redis Slave 28

k8petstore Users BPS Data Generator Public IP Proxy Web Application BPS Data Generator BPS Data Generator Redis Master Redis Slave Redis Slave Redis Slave 29

k8petstore Users BPS Data Generator Public IP Proxy Web Application BPS Data Generator BPS Data Generator Redis Master Redis Slave Redis Slave Redis Slave 30

k8petstore 31

Use Cases Configuration Scalability Fault Tolerance 32

k8petstore OpenContrail networking solution demo 1 Kubernetes JuJu Charm documentation example 2 Kubernetes v1.0 launch talk at OSCON 3 [1] - https://pedrormarques.wordpress.com/2015/04/24/kubernetes-andopencontrail/ [2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281 33

APACHE BIGTOP DATA GENERATORS 34

BigPetStore 35

BigTop Weatherman 36

BigTop Bazaar 37

Vision Encourage synthetic data generation for testing and realistic examples Serve as a resource for the larger Apache and open source communities Emphasis on Flexibility Scalability Realism We look forward to collaborating and getting folks involved! 38

Conclusion Synthetic data generators and blueprints are useful! Case studies: Data Processing Pipelines Cluster Deployment Kubernetes BigPetStore and BigTop Data Generators efforts in Apache BigTop Open invitation to get involved and collaborate 39

Resources http://bigtop.apache.org/ http://github.com/apache/bigtop http://rnowling.github.io/ 40

QUESTIONS 41