Nearest Neighbor Join with Groups and Predicates

Size: px

Start display at page:

Download "Nearest Neighbor Join with Groups and Predicates"

Morgan Todd
7 years ago
Views:

1 Nearest Neighbor Join with Groups and Predicates F. Cafagna, M. Boehlen, A. Bracher Department of Computer Science Agroscope University of Zürich Switzerland DOLAP, Melbourne

2 Francesco Cafagna - University of Zurich - DOLAP /7 Table of Contents The Problem elated Work: Indexing, Sorting, Segmenting The obust NNJ Analytical Evaluation Experimental Evaluation

3 The Problem Input: Outer elation G T.... Example tuple r 0 Pea.... Fact Table S.. G.. T.. P Example tuple s.. Soy CP S k na ca k om vit.c k i Query: For computing the Gross Energy, the CP value is needed. What is the Crude Protein value for each tuple in? Output:.. G G S.. T T S.. P.... Pea Pea.... CP.... Pea Pea CP.... Soy Soy CP.... Soy Soy CP.. Francesco Cafagna - University of Zurich - DOLAP 2/7

c k i Query: For computing the Gross Energy, the CP value is needed.

4 Francesco Cafagna - University of Zurich - DOLAP 3/7 NNJ with Groups and Predicates Query: What is the Crude Protein value for each row in? T [G, P = CP ] S esult: For each r, find the tuple(s) in S that:. minimize the distance with r.t (closest in time) 2. have the same group as r.g (same feed) 3. satisfy a predicate θ, e.g., P = CP (nutrient is Crude Protein)

T [G, P = CP ] S esult: For each r, find the tuple(s) in S that:.

5 Francesco Cafagna - University of Zurich - DOLAP 4/7 elated Work (): Indexing Yao, B., Li, F., Kumar, P.: K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE 200 Implements a NNJ using an index on similarity attribute T Helps to efficiently find the nearest measurement in time With a predicate (e.g., P = dom ) we get many false hits Example: Query: Nearest measurement for (Soy, ) where P = dom? Problem: The nearest measurement to (Soy, ) may not be dom How bad? Only 90 among M measurements store a dom value! 000 useless measurements are fetched before finding a dom value

ICDE 200 Implements a NNJ using an index on similarity attribute T Helps to efficiently find the nearest measurement in time With a predicate (e.g., P = dom ) we get many false hits Example: Query: Nearest measurement for (Soy, ) where P = dom?

6 elated Work (2): Segmenting & Sorting Galindo-Legaria, Joshi: Orthogonal optimization of subqueries and aggregation. SIGMOD 200 Silva, Aref, Ali: The similarity join database operator. ICDE 200 SegmentApply generalizes an operator w/o groups. For each group in :. fetch from and S the tuples of that group 2. apply a SortMerge NNJ to those tuples Multiple accesses to the fact table cause redundant fetches Example: How: compute a NNJ for Pea, then a NNJ for Soy Problem: Blocks, 2, 3, 6, and 8 are fetched twice! S Francesco Cafagna - University of Zurich - DOLAP /7

fetch from and S the tuples of that group 2.

7 Francesco Cafagna - University of Zurich - DOLAP 6/7 obust NNJ Query Tree () Problem: Searching the entire fact table in the Swiss Feed Data Warehouse takes min (Agroscope has lots of data :-) Goal: Minimize the number of accesses to the fact table. Idea: Use the groups of the query to limit the fact table to its relevant portions. How: Make at most one access to the fact table and eliminate tuples with unneeded groups. Example: It is useless to access tuples with groups {Pea, Soy}. S k na ca k om vit.c k i

Idea: Use the groups of the query to limit the fact table to its relevant portions.

8 Francesco Cafagna - University of Zurich - DOLAP 7/7 obust NNJ Query Tree (2) Problem: Processing groups separately causes redundant fetches Idea: Sorting one big set of tuples w/o redundancies is more efficient than sorting many small sets w redundancies. How: After the relevant portions are fetched, our ronnj:. Sorts the tuples by G and T Tuples are clustered 2. Compute the Join with one scan of the two inputs Example: Each block is accessed only once Input: 6 2 k na ca k om vit.c k i S Merge: 6 2 σ G πg ()S

How: After the relevant portions are fetched, our ronnj:. Sorts the tuples by G and T Tuples are clustered 2.

9 Francesco Cafagna - University of Zurich - DOLAP 8/7 obust NNJ Query Tree (3) Problem: Useless tuples are fetched multiple times. Goal: Avoid useless backtracking, i.e., tuples that are not Nearest Neighbours should never be refetched How: Always eliminate tuples not satisfying predicate θ after the selection on the groups the Query Tree can use an index on G before the SortMerge the nearest neighbours of r i will all be adjacent in S Example: Eliminate tuples with P before the SM for (Pea, ) only the nearest neighb. are refetched in S 4 σ G πg ()S

ess backtracking, i.e., tuples that are not Nearest Neighbours should never be refetched How: Always eliminate tuples not satisfying

10 Francesco Cafagna - University of Zurich - DOLAP 9/7 obust NNJ Query Tree (4) This Query Tree uses the left subtree (relation ) to limit the right subtree (the fact table S) to its relevant portions. sort G,T [G] Merge sort G,T σ Filter θ σ Filter G Π G (). The left branch fetches and sorts by G,T S 2. In the right branch, the fact table S is accessed once: make one scan of the FactTable and keep only its relevant tuples, i.e., the tuples with the groups of 3. evaluate θ after the selection on the group (otherwise an index on G is not useful) but before the join (so that no useless backtracking is computed in the Join) 4. Sort the relevant portion by G, T and compute the NNJ at the top.

In the right branch, the fact table S is accessed once: make one scan of the FactTable and keep only its relevant tuples, i.e., the tuples with the groups of 3.

11 Francesco Cafagna - University of Zurich - DOLAP 0/7 Analysis The critical parameters for a NNJ query tree are:. the predicate selectivity 2. the group selectivity Why? A low predicate selectivity implies (many) False Hits Example: Nutrient = 20 False hits per tuple A high group selectivity implies (many) redundant fetches Example: π G () = {Soy, Pea, Beans,...} (i.e., all Legume seeds) each block redundantly fetched up to 20 times

A low predicate selectivity implies (many) False Hits Example: Nutrient = 20 False hits per tuple A high

12 Francesco Cafagna - University of Zurich - DOLAP /7 Predicate Selectivity Let: Θ be the predicate selectivity, i.e., % of tuples satisfying predicate θ 2 σ G πg ()S Example: Tuples satisfying θ P = are: {(Pea, ), (Pea, 3), (Pea, 7), (Soy, 3), (Soy, )} Predicate selectivity Θ = 0 = 0. Swiss Feed DW: P = Θ = 0.004

()S 3 4 6 7 9 3 8 Example: Tuples satisfying θ P = are: {(Pea, ), (Pea, 3), (Pea,

13 Francesco Cafagna - University of Zurich - DOLAP 2/7 Group Selectivity Let: Ψ = Π G () Π G (S) Π G (S) be the group selectivity, i.e., % groups that are relevant 2 σ G πg ()S G = Hay G = Pea G = Soy G = Whey Example: Grouping Attribute: G elevant groups ={Pea, Soy} Group Selectivity Ψ = 2 4 = 0. Swiss Feed DW: π G () = {Soy, Pea, Beans,...} Ψ = 3%

the group selectivity, i.e., % groups that are relevant 2 σ G πg ()S 3 3 4 6 7 9 3 8 2 4 6 G

14 Francesco Cafagna - University of Zurich - DOLAP 3/7 Evaluation & Experiments () [G] Merge sort G,T sort G,T σ Filter θ σ Filter G Π G () S S Θ Ψ Time M 26 sec 0M min 90M h 90M sec Predicate Selectivity Θ, and Group Selectivity Ψ Costs to FactTable S = Fetch(S) + Sort(elevant Portion) Costs on FactTable S s = S + (ΘΨ S ) log(θψ S ) Θ and Ψ effectively reduce the amount of data to sort Example: - Nutrient = CP Θ = 0.4% - π G () = {Soy, Pea, Beans,...} Ψ = 3%

8 sec Predicate Selectivity Θ, and Group Selectivity Ψ Costs to FactTable S = Fetch(S) + Sort(elevant Portion) Costs

15 Francesco Cafagna - University of Zurich - DOLAP 4/7 Evaluation & Experiments (2) Setting: S = 90M; Predicate Selectivity = 0.4%; Group Selectivity = 3% runtime [sec] SegApply ronnj B-Tree # relevant tuples [k] When the # of relevant tuples among the 90M of the fact table grows: # of index false hits of the B-Tree stays the same (i.e., ) The # of blocks that SegApply fetches redundantly grows (> 900k tuples Sorting on Disk) ronnj fetches and sorts S once (always Sorting on Disk but no redundant fetches)

When the # of relevant tuples among the 90M of the fact table grows: # of index false hits of the B-Tree stays the same (i.e., 0.

16 Francesco Cafagna - University of Zurich - DOLAP /7 Evaluation & Experiments (3) Settings: Group Selectivity = 3% runtime [sec] 00 0 AntiJ B-Tree SegApply ronnj e-0 pred. selectivity [%] When the predicate gets more selective: # of index false hits of the B-Tree grows for a predicate (almost) never true, checking all possible combinations (i.e., Antijoin) is cheaper than computing false hits SegApply and ronnj sort less tuples and improve

selectivity [%] When the predicate gets more selective: # of index false hits of the B-Tree grows for a predicate

17 Francesco Cafagna - University of Zurich - DOLAP 6/7 Evaluation & Experiments (4) Setting: Predicate Selectivity = 0.4% runtime [sec] B-Tree SegApply ronnj group selectivity [%] When the # of involved groups grows: # index false hits of the B-Tree stays the same but caching is less effecting for the false hits SegApply computes more NNJs higher probability of redundant fetches # tuples for ronnj to sort increases but no redundant fetches or false hits.

grows: # index false hits of the B-Tree stays the same but caching is less effecting for the false hits SegApply computes

18 Francesco Cafagna - University of Zurich - DOLAP 7/7 Conclusions & Future Work We have proposed the Nearest Neighbour Join with Groups and Predicates. Its Query Tree uses the groups of the left subtree to limit the portion of the right subtree (i.e., the fact table) to fetch Its evaluation Algorithm, ronnj, is independent on the physical organization of the fact table It does not suffer from index false hits or redundant fetches Future Work: NNJ with similarity on timestamps with different granularities (e.g., instead of a complete date, only the month or season may be available).

evaluation Algorithm, ronnj, is independent on the physical organization of the fact table It does not suffer from index false hits or redundant

19 Nearest Neighbor Join with Groups and Predicates F. Cafagna, M. Boehlen, A. Bracher Department of Computer Science Agroscope University of Zürich Switzerland DOLAP, Melbourne

SQL Query Evaluation. Winter 2006-2007 Lecture 23

SQL Query Evaluation. Winter 2006-2007 Lecture 23 SQL Query Evaluation Winter 2006-2007 Lecture 23 SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution