Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany

Similar documents

Efficiently Identifying Inclusion Dependencies in RDBMS

Parallels Plesk Automation

Application Tool for Experiments on SQL Server 2005 Transactions

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Advanced SQL Query To Flink Translator

Ignify ecommerce. Item Requirements Notes

Use of Hadoop File System for Nuclear Physics Analyses in STAR

A Performance Analysis of Distributed Indexing using Terrier

LDIF - Linked Data Integration Framework

BIG DATA What it is and how to use?

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

CIS 631 Database Management Systems Sample Final Exam

Beyond Sampling: Fast, Whole- Dataset Analytics for Big Data on Hadoop

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

Unique column combinations

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Very Large Enterprise Network, Deployment, Users

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

GraySort on Apache Spark by Databricks

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Covering or Complete? Discovering Conditional Inclusion Dependencies

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

How To Create A Large Data Storage System

Benchmarking Cassandra on Violin

Flexible Reliable Affordable Web-scale computing.

How To Scale Out Of A Nosql Database

Very Large Enterprise Network Deployment, 25,000+ Users

IncidentMonitor Server Specification Datasheet

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop Architecture. Part 1

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Oracle Platform as a Service and Infrastructure as a Service Public Cloud Service Descriptions-Metered & Non-Metered.

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Data Management in SAP Environments

Adonis Technical Requirements

SOLUTION BRIEF: SLCM R12.8 PERFORMANCE TEST RESULTS JANUARY, Submit and Approval Phase Results

Performance test report

A stream computing approach towards scalable NLP

Understanding Hadoop Performance on Lustre

Data Domain Profiling and Data Masking for Hadoop

Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Oracle Big Data SQL Technical Update

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

MapReduce Job Processing

MapReduce and Hadoop Distributed File System

Similarity Search in a Very Large Scale Using Hadoop and HBase

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Apache Flink. Fast and Reliable Large-Scale Data Processing

Integrating Apache Spark with an Enterprise Data Warehouse

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Ahsay Online Backup Suite v5.0. Whitepaper Backup speed

Business Intelligence Extensions for SPARQL

Benchmarking the Availability and Fault Tolerance of Cassandra

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Parquet. Columnar storage for the people

efiletexas.gov Infrastructure Guidelines

Legal Notices Introduction... 3

The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Federated Query Processing over Linked Data

MapReduce for Data Warehouses

Certified Big Data and Apache Hadoop Developer VS-1221

BIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT

Pricing Guide. Overview FD Enterprise License SaaS Packages Dedicated SaaS Shared SaaS. Page 2 Page 3 Page 4 Page 5 Page 8

A. Aiken & K. Olukotun PA3

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

MapReduce and Hadoop Distributed File System V I J A Y R A O

Promise Pegasus R6 Thunderbolt RAID Performance

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Cost-Effective Business Intelligence with Red Hat and Open Source

Xpresstransfer Online Backup Suite v5.0 Whitepaper Backup speed analysis

Chapter 7. Using Hadoop Cluster and MapReduce

Business Intelligence and Column-Oriented Databases

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

AUTOMATIC DATABASE CONSTRUCTION FROM NATURAL LANGUAGE REQUIREMENTS SPECIFICATION TEXT

Synergis Software 18 South 5 TH Street, Suite 100 Quakertown, PA , version

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Proposal of Dynamic Load Balancing Algorithm in Grid System

Data Mining with Hadoop at TACC

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

CA Nimsoft Monitor. Probe Guide for iseries System Statistics Monitoring. sysstat v1.1 series

Comparison of Different Implementation of Inverted Indexes in Hadoop

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

IBM Text Analytics on Apache Spark

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Big Data for Investment Research Management

Redundant Data Removal Technique for Efficient Big Data Search Processing

SNOW LICENSE MANAGER (7.X)... 3

Big Fast Data Hadoop acceleration with Flash. June 2013

Transcription:

Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

Inclusion Dependencies Examples Customers ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller Flughafenstr. 63 3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst Ziegelstr. 76 5 Thorsten Mauer Güntzelstr. 90 Customer Quantity ID Orders Customer Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1 2

Inclusion Dependencies Examples http://geneontology.org/sites/default/files/public/diag-godb-er.jpg 3

Inclusion Dependencies Example http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg 4

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

Related Work MIND Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53 73, 2009. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 5 PE-383-5 1 7

Related Work MIND Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Quantity ID Quantity? Quantity Quantity Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Flughafenstr. 63 Intersection Name Address ID, Quantity 8

Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 5 PE-383-5 1 9

Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006. ID Name Customer Quantity 1 Barbara Pabst 1 CK-242-1 1 2 Dennis Eberhart 3 DF-098-7 2 3 Sandra Möller 5 KE-883-6 4 5 Tanja Jager Thorsten Mauer LM-437-2 PE-383-5 10

Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr ID Name Addr Cus Qty 1 T.J. M.12 2 S.M. F.63 Cus Qty 3 D.E. S.19 3 CK 1 4 B.P. Z.76 3 DF 1 5 T.M. G.90 3 KE 1 1 LM 2 5 PE 1 Step 1: 1 1 1 2 2 3 3 4 5 5 T.J. S.M. Calculate full outer join of all attributes Step 2: Quantity ID Customer ID Extract inclusion dependencies 11

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Qty 1 LM 2 5 PE 1 Cus Qty 3 CK 1 3 DF 1 3 KE 1 13

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 2 ID T.J. Name S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 3 Cus 3 Cus CK DF KE 1 Qty 1 Qty 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 1 Cus 5 Cus LM PE 2 Qty 1 Qty 5 Cus, ID 14

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 1 ID, Cus, T.J. Qty Name 2 ID S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 2 ID, Qty CK DF KE 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 3 ID, Cus B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 LM PE 2 Qty 5 Cus, ID 15

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID, Cus, Qty ID Name Name Addr Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 ID, Qty Cus, ID Name ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus, ID Name Addr Addr Addr Cus Qty 1 LM 2 5 PE 1 Name 16

SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID ID, Cus Name Addr ID, Cus ID, Qty Name 17

SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Addr Name ID ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Ø ID, Cus ID, Qty ID Qty ID Ø ID Cus Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 18

SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Addr Ø Name Name Ø ID ID Ø Ø ID, Cus ID, Qty ID Ø Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 19

SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Ø Quantity ID Customer ID Name Ø 20

SINDY Variants Inclusion dependencies on combinations of columns (aka n-ary INDs) Adaption: Create cells for combinations of values Powerful in combination with apriori-like proceeding Partial inclusion dependencies Adaption: aggregate IND candidates with multiset union instead of intersection Compare with number of distinct values of dependent column 21

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

Evaluation Experimental setup Cluster Setup 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) Apache HDFS 2.2, Apache Flink 0.6.2 Single node (for SPIDER) Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 Datasets Relational datasets from different domains 16 KB to 44.9 GB 23

runtime [s] speed up Evaluation Performance comparison with SPIDER 10000 1000 SINDY SPIDER Speed Up 10 9 8 7 100 6 10 5 4 3 1 0.1 2 1 0 24

runtime [s] Evaluation Scale-Out Behavior 2048 1024 512 256 128 64 32 16 8 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB) 25

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

Scaling Out the Discovery of Inclusion Dependencies Conclusions Presented new distributed IND discovery algorithm Applicable for unary, n-ary, and partial inclusion dependencies Consists of full outer join calculation and extraction phase Scales well on large datasets Open questions How can one continuously maintain inclusion dependencies? How can the algorithm be applied to similar problems, e.g., RDF data? 27

Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

Backup Apache Hadoop Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 29

Backup Apache Flink/Spark Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 30

runtime [s] runtime [s] Backup Column and row scaling behavior 1200 1200 1000 MB PDB 1000 MB PDB 800 800 600 600 400 400 200 0 share of rows 200 0 0 1000 2000 # columns 31

#INDs runtime [ms] Backup Evaluation: Wikitables 7,000,000,000 6,000,000,000 5,000,000,000 4,000,000,000 3,000,000,000 2,000,000,000 1,000,000,000 0 #INDs runtime [ms] 2,000,000 1,800,000 1,600,000 1,400,000 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 0 500,000 1,000,000 1,500,000 #columns 32

Runtime [s] Backup Evaluation: n-ary INDs 90 80 70 60 50 n=6 n=5 n=4 40 n=3 n=2 30 n=1 20 10 0 SCOP COMA CENSUS CATH BIOSQLSP TPC-H remainder 33