Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany

Size: px

Start display at page:

Download "Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany"

Isabel Ray
8 years ago
Views:

1 Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

2 Inclusion Dependencies Examples Customers ID Name Address 1 Tanja Jager Marseiller Str Sandra Möller Flughafenstr Dennis Eberhart Sonnenallee 19 4 Barbara Pabst Ziegelstr Thorsten Mauer Güntzelstr. 90 Customer Quantity ID Orders Customer Quantity 3 CK DF KE LM PE

63 3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst Ziegelstr.

3 Inclusion Dependencies Examples 3

4 Inclusion Dependencies Example 4

5 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

6 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

7 Related Work MIND Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53 73, ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr CK Dennis Eberhart Sonnenallee 19 3 DF Barbara Pabst Ziegelstr KE Thorsten Mauer Güntzelstr LM PE

Journal of Intelligent Information Systems, 32:53 73, 2009. ID Name Address 1 Tanja Jager Marseiller Str.

8 Related Work MIND Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Quantity ID Quantity? Quantity Quantity Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Flughafenstr. 63 Intersection Name Address ID, Quantity 8

9 Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr CK Dennis Eberhart Sonnenallee 19 3 DF Barbara Pabst Ziegelstr KE Thorsten Mauer Güntzelstr LM PE

ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr.

10 Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, ID Name Customer Quantity 1 Barbara Pabst 1 CK Dennis Eberhart 3 DF Sandra Möller 5 KE Tanja Jager Thorsten Mauer LM PE

Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr ID Name Addr Cus Qty 1 T.J. M.12 2 S.M. F.63 Cus Qty 3 D.E. S.19 3 CK 1 4 B.P. Z.

11 Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr ID Name Addr Cus Qty 1 T.J. M.12 2 S.M. F.63 Cus Qty 3 D.E. S.19 3 CK 1 4 B.P. Z.76 3 DF 1 5 T.M. G.90 3 KE 1 1 LM 2 5 PE 1 Step 1: T.J. S.M. Calculate full outer join of all attributes Step 2: Quantity ID Customer ID Extract inclusion dependencies 11

12 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

13 SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Qty 1 LM 2 5 PE 1 Cus Qty 3 CK 1 3 DF 1 3 KE 1 13

14 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 2 ID T.J. Name S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 3 Cus 3 Cus CK DF KE 1 Qty 1 Qty 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 1 Cus 5 Cus LM PE 2 Qty 1 Qty 5 Cus, ID 14

M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.

15 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 1 ID, Cus, T.J. Qty Name 2 ID S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 2 ID, Qty CK DF KE 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 3 ID, Cus B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 LM PE 2 Qty 5 Cus, ID 15

19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 2 ID, Qty CK DF KE 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.

16 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID, Cus, Qty ID Name Name Addr Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 ID, Qty Cus, ID Name ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus, ID Name Addr Addr Addr Cus Qty 1 LM 2 5 PE 1 Name 16

17 SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID ID, Cus Name Addr ID, Cus ID, Qty Name 17

18 SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Addr Name ID ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Ø ID, Cus ID, Qty ID Qty ID Ø ID Cus Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 18

Addr Ø Name Ø ID Ø Ø ID, Cus ID, Qty ID Qty ID Ø ID Cus Qty ID

19 SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Addr Ø Name Name Ø ID ID Ø Ø ID, Cus ID, Qty ID Ø Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 19

Ø Name Name Ø ID ID Ø Ø ID, Cus ID, Qty ID Ø Qty ID Cus ID

20 SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Ø Quantity ID Customer ID Name Ø 20

21 SINDY Variants Inclusion dependencies on combinations of columns (aka n-ary INDs) Adaption: Create cells for combinations of values Powerful in combination with apriori-like proceeding Partial inclusion dependencies Adaption: aggregate IND candidates with multiset union instead of intersection Compare with number of distinct values of dependent column 21

22 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

23 Evaluation Experimental setup Cluster Setup 1 master node (Intel 2x2.67 GHz, 8 GiB RAM) 10 worker nodes (Intel Core 2 2x2.6 GHz, 8 GiB RAM) Apache HDFS 2.2, Apache Flink Single node (for SPIDER) Intel 8x2GHz, 128 GiB RAM, RAID-1 Datasets Relational datasets from different domains 16 KB to 44.9 GB 23

24 runtime [s] speed up Evaluation Performance comparison with SPIDER SINDY SPIDER Speed Up

25 runtime [s] Evaluation Scale-Out Behavior /1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB) 25

26 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

27 Scaling Out the Discovery of Inclusion Dependencies Conclusions Presented new distributed IND discovery algorithm Applicable for unary, n-ary, and partial inclusion dependencies Consists of full outer join calculation and extraction phase Scales well on large datasets Open questions How can one continuously maintain inclusion dependencies? How can the algorithm be applied to similar problems, e.g., RDF data? 27

28 Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

29 Backup Apache Hadoop Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 29

30 Backup Apache Flink/Spark Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 30

31 runtime [s] runtime [s] Backup Column and row scaling behavior MB PDB 1000 MB PDB share of rows # columns 31

32 #INDs runtime [ms] Backup Evaluation: Wikitables 7,000,000,000 6,000,000,000 5,000,000,000 4,000,000,000 3,000,000,000 2,000,000,000 1,000,000,000 0 #INDs runtime [ms] 2,000,000 1,800,000 1,600,000 1,400,000 1,200,000 1,000, , , , , ,000 1,000,000 1,500,000 #columns 32

33 Runtime [s] Backup Evaluation: n-ary INDs n=6 n=5 n=4 40 n=3 n=2 30 n= SCOP COMA CENSUS CATH BIOSQLSP TPC-H remainder 33

Efficiently Identifying Inclusion Dependencies in RDBMS

Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany bauckmann@informatik.hu-berlin.de