Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
Inclusion Dependencies Examples Customers ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller Flughafenstr. 63 3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst Ziegelstr. 76 5 Thorsten Mauer Güntzelstr. 90 Customer Quantity ID Orders Customer Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1 2
Inclusion Dependencies Examples http://geneontology.org/sites/default/files/public/diag-godb-er.jpg 3
Inclusion Dependencies Example http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg 4
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Related Work MIND Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53 73, 2009. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 5 PE-383-5 1 7
Related Work MIND Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Quantity ID Quantity? Quantity Quantity Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Flughafenstr. 63 Intersection Name Address ID, Quantity 8
Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 5 PE-383-5 1 9
Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006. ID Name Customer Quantity 1 Barbara Pabst 1 CK-242-1 1 2 Dennis Eberhart 3 DF-098-7 2 3 Sandra Möller 5 KE-883-6 4 5 Tanja Jager Thorsten Mauer LM-437-2 PE-383-5 10
Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr ID Name Addr Cus Qty 1 T.J. M.12 2 S.M. F.63 Cus Qty 3 D.E. S.19 3 CK 1 4 B.P. Z.76 3 DF 1 5 T.M. G.90 3 KE 1 1 LM 2 5 PE 1 Step 1: 1 1 1 2 2 3 3 4 5 5 T.J. S.M. Calculate full outer join of all attributes Step 2: Quantity ID Customer ID Extract inclusion dependencies 11
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Qty 1 LM 2 5 PE 1 Cus Qty 3 CK 1 3 DF 1 3 KE 1 13
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 2 ID T.J. Name S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 3 Cus 3 Cus CK DF KE 1 Qty 1 Qty 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 1 Cus 5 Cus LM PE 2 Qty 1 Qty 5 Cus, ID 14
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 1 ID, Cus, T.J. Qty Name 2 ID S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 2 ID, Qty CK DF KE 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 3 ID, Cus B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 LM PE 2 Qty 5 Cus, ID 15
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID, Cus, Qty ID Name Name Addr Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 ID, Qty Cus, ID Name ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus, ID Name Addr Addr Addr Cus Qty 1 LM 2 5 PE 1 Name 16
SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID ID, Cus Name Addr ID, Cus ID, Qty Name 17
SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Addr Name ID ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Ø ID, Cus ID, Qty ID Qty ID Ø ID Cus Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 18
SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Addr Ø Name Name Ø ID ID Ø Ø ID, Cus ID, Qty ID Ø Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 19
SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Ø Quantity ID Customer ID Name Ø 20
SINDY Variants Inclusion dependencies on combinations of columns (aka n-ary INDs) Adaption: Create cells for combinations of values Powerful in combination with apriori-like proceeding Partial inclusion dependencies Adaption: aggregate IND candidates with multiset union instead of intersection Compare with number of distinct values of dependent column 21
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Evaluation Experimental setup Cluster Setup 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) Apache HDFS 2.2, Apache Flink 0.6.2 Single node (for SPIDER) Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 Datasets Relational datasets from different domains 16 KB to 44.9 GB 23
runtime [s] speed up Evaluation Performance comparison with SPIDER 10000 1000 SINDY SPIDER Speed Up 10 9 8 7 100 6 10 5 4 3 1 0.1 2 1 0 24
runtime [s] Evaluation Scale-Out Behavior 2048 1024 512 256 128 64 32 16 8 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB) 25
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Scaling Out the Discovery of Inclusion Dependencies Conclusions Presented new distributed IND discovery algorithm Applicable for unary, n-ary, and partial inclusion dependencies Consists of full outer join calculation and extraction phase Scales well on large datasets Open questions How can one continuously maintain inclusion dependencies? How can the algorithm be applied to similar problems, e.g., RDF data? 27
Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
Backup Apache Hadoop Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 29
Backup Apache Flink/Spark Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 30
runtime [s] runtime [s] Backup Column and row scaling behavior 1200 1200 1000 MB PDB 1000 MB PDB 800 800 600 600 400 400 200 0 share of rows 200 0 0 1000 2000 # columns 31
#INDs runtime [ms] Backup Evaluation: Wikitables 7,000,000,000 6,000,000,000 5,000,000,000 4,000,000,000 3,000,000,000 2,000,000,000 1,000,000,000 0 #INDs runtime [ms] 2,000,000 1,800,000 1,600,000 1,400,000 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 0 500,000 1,000,000 1,500,000 #columns 32
Runtime [s] Backup Evaluation: n-ary INDs 90 80 70 60 50 n=6 n=5 n=4 40 n=3 n=2 30 n=1 20 10 0 SCOP COMA CENSUS CATH BIOSQLSP TPC-H remainder 33