Big Data Provenance - Challenges and Implications for Benchmarking. Boris Glavic. IIT DBGroup. December 17, 2012



Similar documents
Big Data Provenance: Challenges and Implications for Benchmarking

Big Data Provenance: Challenges and Implications for Benchmarking

Provenance for Data Mining

Reversing Statistics for Scalable Test Databases Generation

LDV: Light-weight Database Virtualization

bigdata Managing Scale in Ontological Systems

Characterizing Task Usage Shapes in Google s Compute Clusters

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Fast Innovation requires Fast IT

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

BSC vision on Big Data and extreme scale computing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data and Big Analytics

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Inferring Fine-Grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy

How To Win At A Game Of Monopoly On The Moon

Service-Oriented Architectures

NoSQL for SQL Professionals William McKnight

Hazeline U. Asuncion. SourceTrac: Tracing Data Sources within Spreadsheets. IPAW 2012, Springer Verlag, June 2012 (to appear).

Amazon Web Services. Elastic Compute Cloud (EC2) and more...

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Research Problems in Data Provenance

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Patterns of Information Management

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data and Analytics: Challenges and Opportunities

Supporting Database Provenance under Schema Evolution

SAP Data Services 4.X. An Enterprise Information management Solution

Big Data Analytics. Chances and Challenges. Volker Markl

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Telemetry Database Query Performance Review

Network Infrastructure Services CS848 Project

Challenges for Data Driven Systems

The basic data mining algorithms introduced may be enhanced in a number of ways.

Big Data Analytics Nokia

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Information Integration for Improved City Construction Supervision

CS Hot topics in database systems: Data Provenance

ICT Perspectives on Big Data: Well Sorted Materials

Duke University

What s New with Informatica Data Services & PowerCenter Data Virtualization Edition

Provenance for the Cloud

Reusable Data Access Patterns

and NoSQL Data Governance for Regulated Industries Using Hadoop Justin Makeig, Director Product Management, MarkLogic October 2013

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

A Generic Auto-Provisioning Framework for Cloud Databases

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Database Design Patterns. Winter Lecture 24

Data sharing in the Big Data era

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Eucalyptus-Based. GSAW 2010 Working Group Session 11D. Nehal Desai

Report 02 Data analytics workbench for educational data. Palak Agrawal

Amr El Abbadi. Computer Science, UC Santa Barbara

Big Data, Deep Learning and Other Allegories: Scalability and Fault- tolerance of Parallel and Distributed Infrastructures.

Big Data looks Tiny from the Stratosphere

Research of Service Granularity Base on SOA in Railway Information Sharing Platform

Foundations and Applications of Data Provenance

Challenges for Provenance in Cloud Computing

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Dynamic Querying In NoSQL System

Data Management in the Cloud

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

Next-Generation Cloud Analytics with Amazon Redshift

Hybrid Solutions Combining In-Memory & SSD

ERM Technology for Insurance Companies

Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Lecture Data Warehouse Systems

Preparing Your Data For Cloud

Provenance as First Class Cloud Data

LDIF - Linked Data Integration Framework

ProTrack: A Simple Provenance-tracking Filesystem

Moving From Hadoop to Spark

VStore++: Virtual Storage Services for Mobile Devices

Energy Efficient MapReduce

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Knowledge Discovery from patents using KMX Text Analytics

Bringing Big Data Modelling into the Hands of Domain Experts

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Smarter Planet evolution

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Trustworthiness of Big Data

Giving life to today s media distribution services

Data Modeling for Big Data

Improv: Flexible Data Provenance for Relational Databases

Rackspace Cloud Databases and Container-based Virtualization

Using Data Mining and Machine Learning in Retail

Towards Analytical Data Management for Numerical Simulations

Ganzheitliches Datenmanagement

The BigData Top100 List Initiative. Chaitan Baru San Diego Supercomputer Center

MDM and Data Warehousing Complement Each Other

Building the Internet of Things Jim Green - CTO, Data & Analytics Business Group, Cisco Systems

Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques

Transcription:

Big Data Provenance - Challenges and Implications for Benchmarking Boris Glavic IIT DBGroup December 17, 2012

Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions

Disclaimer My previous benchmark experience Limited to being a user My main background is in provenance... mostly for RDBMS The main point of this talk (visions) Provenance as a benchmark use-case Provenance as a supporting technology for benchmarking I will make some outrageous claims to get my point across Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance

What is Provenance? Information about the creation process and origin of data Data items and collections Data item = atomic unit of data Transformation Abstraction for processing Inputs and outputs are data items (collections) Example Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance

Types of Provenance Given a data item d Data Provenance Which data items where used in the generation of d Representation: e.g., a set of data items Transformation Provenance Which transformations generated d transitively Additional Information Execution Environment User Time... Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance

Data and Transformation Provenance Example (Data Provenance) Data Provenance Provenance for Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance

Data and Transformation Provenance Example (Transformation Provenance) Transformation Provenance Provenance for Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance

Coarse-grained vs. Fine-grained Data Provenance Coarse-grained Provenance Transformations are handled as black-boxes Each output of transformation... depends on all inputs Fine-grained Provenance Consider processing logic of transformation to determine data-flow All data items that contributed to a result Sufficiency: sufficient to derive d through transformation Necessity: necessary in deriving d through transformation Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance

Granularity Example (Coarse-grained) coarse grained Provenance for Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance

Granularity Example (Fine-grained) fine grained Provenance for Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance

Use-cases Data debugging E.g., tracing an erroneous data item back to erroneous inputs Trust and Quality E.g., computing trust based on trust in data items in provenance Probabilistic data Computing probability of result based on probabilities in provenance Security Enforce access-control on query results based on provenance Understanding misbehaviour of systems E.g., detecting security breaches Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance

Annotations Provenance as Annotations Model provenance as annotations on the data Provenance management = Efficient storage of annotations Propagating annotations through transformations Example (Systems applying this approach) Perm Orchestra DBNotes Trio... and many more Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance

Annotations References Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373 396, 2005. Jennifer Widom. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, 2008. Boris Glavic and Gustavo Alonso. Perm: Processing Provenance and Data on the same Data Model through Query Rewriting. ICDE, 174-185, 2009. T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen. Provenance in ORCHESTRA. IEEE Data Eng. Bull., 33(3):9-16, 2010. Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance

Size Considerations Provenance models dependencies between inputs and outputs of a transformation A subset of Inputs Outputs Can be quadratic in size DAG of transformations multiply by maximal path length Provenance of two data items often overlaps Reuse of intermediate results Coarse-grained provenance Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance

Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions

Big Provenance Challenges Big data analytics is all about agility un/semi-structured data and no extensive meta-data available hard to define provenance model Transparency of distribution Provenance use-case may require location information Analytics that cross system boundaries Inter-operability of storage formats and systems Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Example - Provenance for Map-Reduce Workflows Fine-grained provenance for workflows that are DAGs of map and reduce functions Add provenance to values: (key, value) (key, (value, p)) Provenance determined by map/reduce function semantics Wrap map and reduce functions to handle provenance Strip of and cache provenance Call original function Re-attach provenance according to function semantics References R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. CIDR, 273-283, 2011. Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Example - OS Provenance for the Cloud PASS with Cloud Storage Backend File and process level provenance Each node runs a modified linux kernel (PASS) Intercept system-calls for file and process creation Store provenance in Amazon S3, SimpleDB, and/or SQS Instrumenting the Xen Hypervisor File and process level provenance for virtual machines Intercept hyper-calls for file and process creation Store provenance in DB Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Example - OS Provenance for the Cloud References M.I. Seltzer, P. Macko, and M.A. Chiarini. Collecting provenance via the xen hypervisor. TaPP, 2011. K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer. Provenance for the cloud. FAST, 15 14, 2010. Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Example - Sketching Distributed Provenance File + process level provenance modelled as DAG intra- and inter-host dependencies inter = socket communication Intercept system calls Distributed provenance graph storage Each node stores the part of the provenance graph corresponding to its local processing Links to other host for inter-host dependencies Query provenance Nodes exchange summaries of provenance graphs using bloom filters References T. Malik, L. Nistor, and A. Gehani. Tracking and sketching distributed data provenance. escience, 190 197, 2010. Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Take-away Message Challenging problems Approaches that address distribution directly Approaches that map relational techniques to Big Data more work needed! Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance

Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking Provenance as Data and Workloads Provenance-based Performance Metrics Profiling and Monitoring with Provenance 4 Conclusions

Why Benchmark? Impress the Customer - Competition Competition should be fair Stable benchmark Similar to real world workloads Precise definitions probably several iterations of design + test Understand (and Improve) System Performance Stable benchmark Similar to real world workloads Profiling and monitoring Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Benchmarking 101 A Comprehensive Benchmark... Define Parameters Data-set specifications Structure and interrelationships Value ranges and distributions Data type definitions Provide data generator and/or validator Workload specification Specify jobs Restrictions for running the jobs Provide workload generator and/or validator Benchmark metrics specification What to measure How to measure Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Benchmarking 101 A Comprehensive Benchmark... TPC-H Define Parameters e.g., SF Data-set specifications Standard Specification Structure and interrelationships Schema provided Value ranges and distributions e.g., L DISCOUNT column values between 0.00 and 1.00 Data type definitions e.g., date is YYYY-MM-DD at least 14 years range Provide data generator and/or validator dbgen Workload specification Standard Specification Specify jobs Query and update workloads Restrictions for running the jobs Provide workload generator and/or validator qgen Benchmark metrics specification Standard Specification What to measure e.g., Composite Query-per-Hour Metric How to measure e.g., first query char to last output char Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Big Data Benchmarking Data generation is a necessity Shipping pre-generated datasets of Big Data dimensions is unfeasible Complex and mixed workloads better match real-world use of Big Data analytics Hard to understand why a system performs bad/well on a complex workload Robustness of performance and scalability important Measure it! What is a good metric for robustness and how to compute it? Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Generating Large Datasets and Workloads Provenance data is large Compute provenance to generate large datasets Run provenance-aware system to collect provenance for a small workload The resulting data-set can be used in the benchmark Generate large data-sets from simple generators Example Simple task: sharpen an image Apply approach that instruments the program to detect data-dependencies as provenance Huge and very fine-granular provenance Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Stress-Testing Exploitation of Data Commonalities Provenance data large, but large overlap Use provenance to test how well a system is able to exploit data commonalities Example Streaming data with multi-step windowed aggregation Can predict overlap in provenance Generate provenance Queries over provenance as workload Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Data-centric Performance Information as Provenance Assume the hypothetical existence of god Big provenance system Efficiently computes all types of provenance For any Big Data system Also evaluates any types of queries Data-centric performance information as provenance For each data item use god to record execution times of each transformation in provenance history of data movements for each data item... Example (Measure Robustness) Benchmark that requires several runs of mixed job workloads Workloads between runs overlap Performance metric: Resource consumption variation per job Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Using Provenance in Profiling Profiling and Monitoring Identify causes for (benchmark) results Even more important for Big Data Diverse workloads Distribution... Example (How can Provenance help?) How do I profile a program on one node e.g., instrument and collect statistics about execution Provenance provides data-centric view Identify repeated computations Overlaps in provenance (same data, different nodes/transformations) Identify unnecessary computations Computations that are not in the fine-grained provenance of anything Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking

Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions

Conclusions Big Provenance: Many interesting research problems Provenance as benchmark use-case Provenance for benchmark data generation Provenance-based performance metrics Supporting profiling and monitoring Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions

Questions? Info Boris Glavic: http://www.cs.iit.edu/~glavic/ IIT DBGroup: http://www.cs.iit.edu/~dbgroup/ Perm: http://permdbms.sourceforge.net/ Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions