Exploiting A Support-based Upper Bound of Pearson s Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs

Exploiting A Support-baed Upper Bound of Pearon Correlation Coefficient for Efficiently Identifying Strongly Correlated Pair Hui Xiong Computer Science Univerity of Minneota huix@c.umn.edu Shahi Shekhar Computer Science Univerity of Minneota hekhar@c.umn.edu Pang-Ning Tan Computer Science Michigan State Univerity ptan@ce.mu.edu Vipin Kumar Computer Science Univerity of Minneota kumar@c.umn.edu ABSTRACT Given a uer-pecified minimum correlation threhold θ and a market baket databae with N item and T tranaction, an all-trong-pair correlation query find all item pair with correlation above the threhold θ. However, when the number of item and tranaction are large, the computation cot of thi query can be very high. In thi paper, we identify an upper bound of Pearon correlation coefficient for binary variable. Thi upper bound i not only much cheaper to compute than Pearon correlation coefficient but alo exhibit a pecial monotone property which allow pruning of many item pair even without computing their upper bound. A Two-tep All-trong-Pair correlation query (TAPER) algorithm i propoed to exploit thee propertie in a filter-and-refine manner. Furthermore, we provide an algebraic cot model which how that the computation aving from pruning i independent or improve when the number of item i increaed in data et with common Zipf or linear rank-upport ditribution. Experimental reult from ynthetic and real data et exhibit imilar trend and how that the TAPER algorithm can be an order of magnitude fater than brute-force alternative. Categorie and Subject Decriptor H..8 [Databae Management]: Databae Application Data Mining Keyword Pearon Correlation Coefficient, Statitical Computing. INTRODUCTION With the wide pread ue of tatitical technique for data analyi, it i expected that many uch technique will be made available in a databae environment where uer can apply the technique more flexibly, efficiently, eaily, and Permiion to make digital or hard copie of all or part of thi work for peronal or claroom ue i granted without fee provided that copie are not made or ditributed for profit or commercial advantage and that copie bear thi notice and the full citation on the firt page. To copy otherwie, to republih, to pot on erver or to reditribute to lit, require prior pecific permiion and/or a fee. KDD 4, Augut 5, 4, Seattle, Wahington, USA. Copyright 4 ACM -583-888-/4/8...$5.. with minimal mathematical aumption. Our reearch i directed toward developing uch technique. More pecifically, thi paper examine the problem of computing correlation efficiently from large databae. Correlation analyi play an important role in many application domain uch a market-baket analyi, climate tudie, and public health. Our focu, however, i on computing an alltrong-pair correlation query that return pair of high poitively correlated item (or binary attribute). Thi problem can be formalized a follow: Given a uer-pecified minimum correlation threhold θ and a market baket databae with N item and T tranaction, an all-trong-pair correlation query find all item pair with correlation above the minimum correlation threhold, θ. However, a the number of item and tranaction increae, the computation cot for an all-trong-pair correlation query become prohibitively expenive. For example, conider a databae of 6 item, which may repreent the collection of book available at an e-commerce Web ite. Anwering the all-trong-pair correlation query from uch a maive databae require computing the correlation of 6 `.5 poible item pair. Thu, it may not be computationally feaible to apply a brute-force approach to compute correlation for all half trillion item pair, particularly when the number of tranaction i alo large. Note that the all-trong-pair correlation query problem i different from the tandard aociation-rule mining problem [, 3, 5, 9, 4]. Given a et of tranaction, the objective of aociation rule mining i to extract all ubet of item that atify a minimum upport threhold. Support meaure the fraction of tranaction that contain a particular ubet of item. The notion of upport and correlation may not necearily agree with each other. Thi i becaue item pair with high upport may be poorly correlated while thoe that are highly correlated may have very low upport. For intance, uppoe we have an item pair {A, B}, where upp(a) = upp(b) =.8 and upp(a, B) =.64. Both item are uncorrelated becaue upp(a, B) = upp(a)upp(b). In contrat, an item pair {A, B} with upp(a) = upp(b) = upp(a, B) =. i perfectly correlated depite it low upport. Pattern with low upport but high correlation are ueful for capturing intereting aociation among rare anomalou event or rare but expenive item uch a gold necklace and earring. In thi paper, we focu on the efficient computation of tatitical correlation for all pair of item with high po-

itive correlation. More pecifically, we provide an upper bound of Pearon correlation coefficient for binary variable. The computation of thi upper bound i much cheaper than the computation of the exact correlation, ince thi upper bound can be computed a a function of the upport of individual item. Furthermore, we how that thi upper bound ha a pecial monotone property which allow elimination of many item pair even without computing their upper bound, a hown in Figure. The x-axi in the figure repreent the et of item having a lower level of upport than the upport for item x i. Thee item are orted from left to right in decreaing order of their individual upport value. The y-axi indicate the correlation between each item x and item x i. Upperbound(x i, x) repreent the upper bound of correlation(x i, x) and ha a monotone decreaing behavior. Thi behavior guarantee that an item pair (x i, x k ) can be pruned if there exit an item x j uch that upperbound(x i, x j) < θ and upp(x k ) < upp(x j). upperbound(x i, x) correlation(x i, x) x i+ x j x k x n item orted in decending order by upp(item) Figure : Illutration of the Filtering Technique. (The curve are only ued for illutration purpoe.) A Two-tep All-trong-Pair correlation query (TAPER) algorithm i propoed to exploit thee propertie in a filterand-refine manner which conit of two tep: filtering and refinement. In the filtering tep, many item pair are filtered out uing the eay-to-compute upperbound(x i, x) and it monotone property. In the refinement tep, the exact correlation i computed for remaining pair to determine the final query reult. In addition, we have proved the completene and correctne of the TAPER algorithm and provided an algebraic cot model to quantify it computational aving. A demontrated by our experiment on both real and ynthetic data et, TAPER can be an order of magnitude fater than brute-force alternative and the computational aving by TAPER i independent or improve when the number of item i increaed in data et with common Zipf [8] or linear rank-upport ditribution.. Related Work Related literature can be grouped into two categorie. One category ha focued on tatitical correlation meaure. Jermaine [] invetigated the implication of incorporating chi-quare (χ ) [5] baed querie to data cube computation. He howed that finding the ubcube that atify tatitical tet uch a χ are inherently NP-hard, but can be made more tractable uing approximation cheme. Alo, Jermaine preented an iterative procedure for highdimenional correlation analyi by having off part of the databae via feedback from human expert []. Finally, x Brin [3] propoed a χ -baed correlation rule mining trategy. However, χ doe not poe a deired upward cloure property for exploiting efficient computation [7]. In thi paper, we focu on the efficient computation of tatitical correlation for all pair of item with high poitive correlation. Given n item, a traditional brute force approach compute Pearon correlation coefficient for all `n n(n ) = item pair. Thi approach i often implemented uing matrix algebra in tatitical oftware package a the correlation matrix [] function, which compute Pearon correlation coefficient for all pair of column. Thi approach i applicable to but not efficient for the cae of Boolean matrice, which can model market-baket-type data et. The approach propoed in thi paper doe not need to compute all `n pair. In particular, for market-baket-type data et with a Zipf-like rank-upport ditribution, we how that only a mall portion of the item pair need to be examined. In the real world, Zipf-like ditribution have been oberved in a variety of application domain, uch a retail data and Web click-tream. Another category of related work i from the aociationrule mining framework [], namely contraint-baed aociation pattern mining [, 4, 6, 8, 3]. Intead of uing tatitical correlation meaure a the contraint, thee approache ue ome other meaure (contraint), uch a upport, lift, and the Jaccard meaure, for efficiently pruning the pattern earch pace and identifying intereting pattern.. Overview and Scope The remainder of thi paper i organized a follow. Section preent baic concept. In ection 3, we introduce the upper bound of Pearon correlation coefficient for binary variable. Section 4 propoe the TAPER algorithm. In ection 5, we analyze the TAPER algorithm in the area of completene, correctne, and computation gain. Section 6 preent the experimental reult. Finally, in ection 7, we draw concluion and ugget future work. The cope of the all-trong-pair correlation query problem propoed in thi paper i retricted to market baket databae with binary variable, and the correlation computational form i Pearon correlation coefficient for binary variable, which i alo called the φ correlation coefficient. Furthermore, we aume that the upport of item i between and but not equal to either or. Thee boundary cae can be handled eparately.. PEARSON S CORRELATION In tatitic, a meaure of aociation i a numerical index which decribe the trength or magnitude of a relationhip among variable. Although literally dozen of meaure exit, they can be categorized into two broad group: ordinal and nominal. Relationhip among ordinal variable can be analyzed with ordinal meaure of aociation uch a Kendall Tau and Spearman Rank Correlation Coefficient. In contrat, relationhip among nominal variable can be analyzed with nominal meaure of aociation uch a Pearon Correlation Coefficient, the Odd Ratio, and meaure baed on Chi Square [5]. The φ correlation coefficient [5] i the computation form of Pearon Correlation Coefficient for binary variable. In thi ection, we decribe the φ correlation coefficient and how how it can be computed uing the upport meaure of aociation-rule mining [].

In a two-way table hown in Figure, the calculation of the φ correlation coefficient reduce to φ = P ()P () P () P () p, () P(+) P (+) P (+) P (+) where P (ij), for i =, and j =,, denote the number of ample which are claified in the ith row and jth column of the table. Furthermore, we let P (i+) denote the total number of ample claified in the ith row, and we let P (+j) denote the total number of ample claified in the jth column. Thu, P (i+) = P j= P (ij) and P (+j) = P i= P (ij) A Column Total P P () () P (+) B P () P () P (+) Row Total P (+) P (+) Figure : A two-way table of item A and item B. In the two-way table, N i the total number of ample. Furthermore, we can tranform Equation a follow. φ = (N P () P () P () )P () P () P () p P(+) P (+) P (+) P (+) φ = NP () (P () + P () )(P () + P () ) p P(+) P (+) P (+) P (+) φ = P () N P (+) P (+) N N q P(+) P (+) P (+) P (+) N N N N Hence, when adopting the upport meaure of aociation rule mining [], for two item A and B in a market baket databae, we have upp(a) = P (+) /N, upp(b) = P (+) /N, and upp(a, B) = P () /N. With upport notation and the above new derivation of Equation, we can derive the upport form of the φ correlation coefficient a hown below in Equation. φ = upp(a, B) upp(a)upp(b) p upp(a)upp(b)( upp(a))( upp(b)) () 3. PROPERTIES OF φ CORRELATION In thi ection, we preent ome propertie of the φ correlation coefficient. Thee propertie are ueful for the efficient computation of all-trong-pair correlation query. 3. An Upper Bound In thi ubection, we reveal that the upport meaure i cloely related with the φ correlation coefficient. Specifically, we prove that an upper bound of the φ correlation coefficient for a given pair {A, B} exit and i determined only by the upport value of item A and the upport value of item B, a hown below in Lemma. Lemma. Given an item pair {A, B}, the upport value upp(a) for item A, and the upport value upp(b) for item B, without lo of generality, let upp(a) upp(b). The upper bound upper(φ {A,B} ) of the φ correlation coefficient for an item pair {A, B} can be obtained when upp(a, B) = upp(b) and upper(φ {A,B} ) = upp(b) upp(a) N upp(a) upp(b) (3) Proof: According to Equation, for an item pair {A, B}: φ {A,B} = upp(a, B) upp(a)upp(b) p upp(a)upp(b)( upp(a))( upp(b)) When the upport value upp(a) and upp(b) are fixed, φ {A,B} i monotone increaing with the increae of the upport value upp(a, B). By the given condition upp(a) upp(b) and the anti-monotone property of the upport meaure, we get the maximum poible value of upp(a, B) i upp(b). A a reult, the upper bound upper(φ {A,B} ) of the φ correlation coefficient for an item pair {A, B} can be obtained when upp(a, B) = upp(b). Hence, upp(b) upp(a)upp(b) upper(φ {A,B} ) = p upp(a)upp(b)( upp(a))( upp(b)) upp(b) upp(a) = upp(a) upp(b). A can be een in Equation 3, the upper bound of the φ correlation coefficient for an item pair {A, B} relie only on the upport value of item A and the upport value of item B. In other word, there i no requirement to get the upport value upp(a, B) of an item pair {A, B} for the calculation of thi upper bound. A already noted, when the number of item N become very large, it i difficult to tore the upport of every item pair in the memory, ince N(N )/ i a huge number. However, it i poible to tore the upport of individual item in the main memory. A a reult, thi upper bound can erve a a coare filter to filter out item pair which are of no interet, thu aving I/O cot by reducing the computation of the upport value of thoe pruned pair. 3. Conditional Monotone Property In thi ubection, we preent a conditional monotone property of the upper bound of the φ correlation coefficient a hown below in Lemma Lemma. For a pair of item {A, B}, if we let upp(a) > upp(b) and fix the item A, the upper(φ {A,B} ) of pair {A, B} i monotone decreaing with the decreae of the upport value of item B. Proof: By Lemma, we get: upper(φ {A,B} ) = upp(b) upp(a) upp(a) upp(b) For any given two item B and B with upp(a) > upp(b ) > upp(b ), we need to prove upper(φ {A,B }) > upper(φ {A,B }). Thi claim can be proved a follow: upper(φ {A,B }) upper(φ {A,B }) = upp(b ) upp(b ) upp(b ) upp(b ) > The above follow the given condition that upp(b ) > upp(b ) and ( upp(b )) < ( upp(b )). Lemma allow u to puh the upper bound of the φ correlation coefficient into the earch algorithm, thu efficiently pruning the earch pace. Corollary. When earching for all pair of item with correlation above a uer-pecified threhold θ, if an item lit {i, i,..., i m} i orted by item upport in non-increaing order, an item pair {i a, i c} with upp(i a) > upp(i c) can be pruned if upper(φ{i a, i b }) < θ and upp(i c) upp(i b ).

Proof: Firt, when upp(i c) = upp(i b ), we get upper(φ(i a, i c)) = upper(φ(i a, i b )) < θ according to Equation 3 and the given condition upper(φ{i a, i b }) < θ, then we can prune the item pair {i a, i c}. Next, we conider upp(i c) < upp(i b ). Since upp(i a) > upp(i b ) > upp(i c), by Lemma, we get upper(φ{i a, i c}) < upper(φ{i a, i b }) < θ. Hence, the pair {i a, i c} i pruned. 4. THE TAPER ALGORITHM In thi ection, we preent the Two-tep All-trong-Pair correlation query (TAPER) algorithm. The TAPER algorithm i a two-tep filter-and-refine query proceing trategy which conit of two tep: filtering and refinement. The Filtering Step: In thi tep, the TAPER algorithm applie two pruning technique. The firt technique ue the upper bound of the φ correlation coefficient a a coare filter. In other word, if the upper bound of the φ correlation coefficient for an item pair i le than the uer-pecified correlation threhold, we can prune thi item pair right way. The econd pruning technique prune item pair baed on the conditional monotone property of the upper bound of the φ correlation coefficient. The correctne of thi pruning i guaranteed by Corollary and the proce of thi pruning i illutrated in Figure a previouly noted in introduction. In ummary, the purpoe of the filtering tep i to reduce fale poitive item pair and further proceing cot. The Refinement Step: In the refinement tep, the TA- PER algorithm compute the exact correlation for each urviving pair from the filtering tep and retrieve the pair with correlation above the uer-pecified minimum correlation threhold a the query reult. Figure 3 how the peudocode of the TAPER algorithm, including the CoareF ilter and Refine procedure. Procedure CoareF ilter work a follow. Line initialize the variable and create an empty query reult et P. Line - ue Rymon generic et-enumeration tree earch framework [6] to enumerate candidate pair and filter out item pair whoe correlation are obviouly le than the uer-pecified correlation threhold θ. Line tart an outer loop. Each outer loop correpond to a earch tree branch. Line 3 pecifie the reference item A, and line 4 tart a earch within each branch. Line 5 pecifie the target item B, and line 6 compute the upper bound of the φ correlation coefficient for item pair {A, B}. In line 7, if thi upper bound i le than the uer-pecified correlation threhold θ, the earch within thi branch can top by exiting from the inner loop, a hown in line 8. The reaon i a follow. Firt, the reference item A i fixed in each branch and it ha the maximum upport value due to the way we contruct the branch. Alo, item within each branch are orted baed on their upport in non-increaing order. Then, by Lemma, the upper bound of the φ correlation coefficient for the item pair {A, B} i monotone decreaing with the decreae of the upport of item B. Hence, if we find the firt target item B which reult in an upper bound upper(φ {A,B} ) that i le than the uer-pecified correlation threhold θ, we can top the earch in thi branch. Line call the procedure Refine to compute the exact correlation for each urviving candidate pair and continue to check the next target item until no target item i left in the current earch branch. Procedure Refine work a follow. Line get the upport for the item pair {A, B}. Note that the I/O cot can TAPER ALGORITHM Input: S : an item lit orted by item upport in non-increaing order. θ: a uer-pecified minimum correlation threhold. Output: P: the reult of all-trong-pair correlation query. Variable: L: the ize of item et S. A: the item with larger upport. B: the item with maller upport. CoareFilter(S, θ) //The Filtering Step. L = ize(s ), P =. for i from to L- 3. A = S [i] 4. for j from i+ to L 5. B = S [j] q q upp(b) upp(a) upp(a) upp(b) 6. upper(φ) = 7. if(upper(φ) < θ) then //Pruning by the monotone property 8. break from inner loop 9. ele. P=P Refine(A, B, θ) Refine(A, B, θ) //The Refinement Step. Get the upport upp(a, B) of item et {A, B} upp(a,b) upp(a)upp(b). φ = upp(a)upp(b)( upp(a))( upp(b)) 3. if φ < θ then 4. return //return NULL 5. ele 6. return {{A, B}, φ} Figure 3: The TAPER Algorithm be very expenive for line when the number of item i large ince we cannot tore the upport of all item pair in the memory. Line calculate the exact correlation coefficient of thi item pair. If the correlation i greater than the uer-pecified minimum correlation threhold, thi item pair i returned a a query reult in line 6. Otherwie, the procedure return NULL in line 4. Example. To illutrate the TAPER algorithm, conider a databae hown in Figure 4. To implify the dicuion, we ue an item lit {,, 3, 4, 5, 6} which i orted by item upport in non-increaing order. For a given correlation threhold.36, we can ue Rymon generic et-enumeration tree earch framework [6] to demontrate how two-tep filterand-refine query proceing work. For intance, for the branch tarting from item, we identify that the upper bound of the φ correlation coefficient for the item pair {, 3} i.333, which i le than the given correlation threhold.36. Hence, we can prune thi item pair immediately. Alo, ince the item lit {,, 3, 4, 5, 6} i orted by item upport in non-increaing order, we can prune pair {, 4}, {, 5}, and {, 6} by Lemma without any further computation cot. In contrat, for the traditional filter-and-refine paradigm, the coare filter can only prune the item pair {, 3}. There i no technique to prune item pair{, 4}, {, 5}, and {, 6}. Finally, in the refinement tep, only even item pair are required to compute the exact correlation coefficient, a hown in Figure 4 (c). More than half of the item pair are pruned in the filter tep even though the correlation threhold i a low a.36.

TID Item Item Support Pair UPPER (Φ) Correlation 3 4 5 6 7 8 9,, 3,, 3, 3,,,,, 3, 4, 5, 6,, 4, 5,, 4 3 ( a ) 3 4 5 6 ( b ) {,} {,3} {,4} {,5} {,6} {,3} {,4} {,5} {,6} {3,4} {3,5} {3,6} {4,5} {4,6} {5,6}.9.8.5.3.. {} {, } {, 3} {, 4} {, 5} {, 6} {, 3} {, 4} {, 5} {, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}.667.333 Item -> {} {} {3} {4} {5} {6} Support -> (.9) (.8) (.5) (.3) (.) (.).5.37.655.5.333.764.59.667 ( c ).667 -.5 -.8 Figure 4: Illutration of the filter-and-refine trategy. mean there i no computation required. 5. ANALYSIS OF THE TAPER ALGORITHM In thi ection, we analyze TAPER in the area of completene, correctne, and the computation aving. 5. Completene and Correctne Lemma 3. The TAPER algorithm i complete. In other word, thi algorithm find all pair which have correlation above a uer-pecified minimum correlation threhold. Proof: Thi lemma proof a well a ome following lemma proof are preented in our Technical Report [7]. Lemma 4. The TAPER algorithm i correct. In other word, every pair thi algorithm find ha a correlation above a uer-pecified minimum correlation threhold. 5. Quantifying the Computation Saving Thi ection preent analytical reult for the amount of computational aving obtained by TAPER. Firt, we illutrate the relationhip between the choice of the minimum correlation threhold and the ize of the reduced earch pace (after performing the filtering tep). Knowing the relationhip give u an idea of the amount of pruning achieved uing the upper-bound function of correlation. Figure 5 illutrate a -dimenional plot for every poible combination of upport pair, upp(x) and upp(y). If we impoe the contraint that upp(x) upp(y), then all item pair mut be projected to the upper left triangle ince the diagonal line repreent the condition upp(x) = upp(y). To determine the ize of the reduced earch pace, let u tart from the upper bound function of correlation. upper(φ {x,y} ) = upp(x) upp(y) upp(y) upp(x) < θ.764.59.667 = upp(x)( upp(y)) < θ upp(y)( upp(x)) = upp(y) > upp(x) θ + ( θ )upp(x) (4) The above inequality provide a lower bound on upp(y) uch that any item pair involving x and y can be pruned uing the conditional monotone property of the upper bound function. In other word, any urviving item pair that undergoe the refinement tep mut violate the condition given in Equation 4. Thee item pair are indicated by the haded region hown in Figure 5. During the refinement tep, TA- PER ha to compute the exact correlation for all item pair that fall in the haded region between the diagonal and the polyline drawn by Equation 5. upp(y) = upp(x) θ + ( θ )upp(x) A can be een from Figure 5, the ize of the reduced earch pace depend on the choice of minimum correlation threhold. If we increae the threhold from.5 to.8, the earch pace for the refinement tep i reduced ubtantially. When the correlation threhold i., the polyline from Equation 5 overlap with the diagonal line. In thi limit, the earch pace for the refinement tep become zero. Supp(y).8.6.4. θ =.5. θ =.8 θ =.4.6.8 Supp(x) Figure 5: An illutration of the reduced earch pace for the refinement tep of the TAPER algorithm. Only item pair within the haded region mut be computed for their correlation. The above analyi how only the ize of the reduced earch pace that mut be explored during the refinement tep of the TAPER algorithm. The actual amount of pruning achieved by TAPER depend on the upport ditribution of item in the databae. To facilitate our dicuion, we firt introduce the definition of everal concept ued in the remainder of thi ection. Definition. The pruning ratio of the TAPER algorithm i defined by the following equation. (5) γ(θ) = S(θ) T, (6) where θ i the minimum correlation threhold, S(θ) i the number of item pair which are pruned before computing their exact correlation at the correlation threhold θ, and T i the total number of item pair in the databae. a given databae, T i a fixed number and i equal to `n = n(n ), where n i the number of item. For Definition. For a orted item lit, the rank-upport function f(k) i a dicrete function which preent the upport in term of the rank k.

For a given databae, let I = {A, A,..., A n} be an item lit orted by item upport in non-increaing order. Then item A ha the maximum upport and the rankupport function f(k) = upp(a k ), k n, which i monotone decreaing with the increae of the rank k. To quantify the computation aving for a given item A j ( j < n) at the threhold θ, we need to find only the firt item A l (j < l n) uch that upper(φ {Aj,A l }) < θ. By Lemma, if upper(φ {Aj,A l }) < θ, we can guarantee that upper(φ {Aj,A i }), where l i n, i le than the correlation threhold θ. In other word, all thee n l + pair can be pruned without a further computation requirement. According to Lemma, we get upper(φ {Aj,A l }) = < upp(a l ) upp(a j ) upp(a l ) upp(a j ) = upp(a j ) upp(a l ) f(l) f(j) < θ Since the rank-upport function f(k) i monotone decreaing with the increae of the rank k, we get l > f (θ f(j)) To make the computation imple, we let l = f (θ f(j))+. Therefore, for a given item A j ( < j n), the computation cot for (n f (θ f(j))) item pair can be aved. A a reult, the total computation aving of the TAPER algorithm i hown below in Equation 7. Note that the computation aving hown in Equation 7 i an underetimated value of the real computation aving which can be achieved by the TAPER algorithm. S(θ) = nx {n f (θ f(j))} (7) j= Finally, we conduct computation aving analyi on the data et with ome pecial rank-upport ditribution. Specifically, we conider three pecial rank-upport ditribution: a uniform ditribution, a linear ditribution, and a generalized Zipf ditribution [8], a hown in the following three cae. CASE I: A Uniform Ditribution. In thi cae, the rank-upport function f(k) = C, where C i a contant. According to Equation 3, the upper bound of the φ correlation coefficient for any item pair i, which i the maximum poible value for the correlation. Hence, for any given item A j, we cannot find an item A l (j < l n) uch that upper(φ {Aj,A l }) < θ, where θ. A a reult, the total computation aving S(θ) i zero. CASE II: A Linear Ditribution. In thi cae, the rank-upport function ha a linear ditribution and f(k) = a mk, where m i the abolute value of the lope and a i the intercept Lemma 5. When a databae ha a linear rank-upport ditribution f(k) and f(k) = a mk (a >, m > ), for a uer-pecified minimum correlation threhold θ, the pruning ratio of the TAPER algorithm increae with the decreae of the ratio a/m, the increae of the correlation threhold θ, and the increae of the number of item, where < θ. CASE III: A Generalized Zipf Ditribution. In thi cae, the rank-upport function ha a generalized Zipf ditribution and f(k) = c, where c and p are contant and k p p. When p i equal to, the rank-upport function ha a Zipf ditribution. Lemma 6. When a databae ha a generalized Zipf rankupport ditribution f(k) and f(k) = c, for a uer-pecified k p minimum correlation threhold θ, the pruning ratio of the TAPER algorithm increae with the increae of p and the correlation threhold θ, where < θ. Furthermore, the pruning ratio i independent when the number of item i increaed. Proof: Since the rank-upport function f(k) = c k p, the invere function f (y) = ( c y ) p. Accordingly, f (θ c f(j)) = ( θ c ) p = j j p (θ ) p Applying Equation 7, we get: nx S(θ) = {n f (θ f(j))} j= = n(n ) = n(n ) nx j= Since the pruning ratio γ(θ) = S(θ) T j (θ ) p (n )(n + ) γ(θ) = n + n and T = n(n ), Thu, we can derive three rule a follow: rule : θ n + n rule : p n + n rule 3 : n lim n γ(θ) γ(θ) n + n = Therefore, the claim that the pruning ratio of the TAPER algorithm increae with the increae of p and the correlation threhold θ hold. Alo, rule 3 indicate that the pruning ratio i independent when the number of item i increaed in data et with Zipf ditribution. 6. EXPERIMENTAL RESULTS In thi ection, we preent the reult of extenive experiment to evaluate the performance of the TAPER algorithm. Specifically, we demontrate: () a performance comparion between the TAPER algorithm and a brute-force approach, () the effectivene of the propoed algebraic cot model, and (3) the calability of the TAPER algorithm. Experimental Data Set: Our experiment were performed on both real and ynthetic data et. Synthetic data et were generated uch that the rank-upport ditribution follow Zipf law, a hown in Figure 6. Note that, in log-log cale, the rank-upport plot of a Zipf ditribution will be a traight line with a lope equal to the exponent P in the

8 Execution Time (ec) 5 5 Brute-force TAPER Execution Time (ec) 5 5 Brute-force TAPER Execution Time (ec) 6 4 8 6 Brute-force TAPER 4.3.4.5.6.7.8.9 Minimum Correlation Threhold.3.4.5.6.7.8.9 Minimum Correlation Threhold.3.4.5.6.7.8.9 Minimum Correlation Threhold (a) P umb (b) P umb (c) Retail Figure 7: TAPER v. a brute-force approach on the P umb, P umb, and retail data et. Prunning Ratio.9.8.7.6.5.4.3.. Prunning Ratio.9.8.7.6.5.4.3.. Prunning Ratio.9.8.7.6.5.4.3....3.4.5.6.7.8.9 Minimum Correlation Threhold..3.4.5.6.7.8.9 Minimum Correlation Threhold..3.4.5.6.7.8.9 Minimum Correlation Threhold (a) P umb (b) P umb (c) Retail Figure 8: The pruning effect of TAPER on P umb, P umb, and retail data et. Table : Parameter of Synthetic Data Set. Data et name T N C P P.tab.8 P.tab.8.5 P3.tab.8.5 P4.tab.8.75 P5.tab.8 upport of item (log cale).... e-5 P.tab P.tab P3.tab P4.tab P5.tab e-6 rank of item (log cale) Figure 6: The plot of the Zipf rank-upport ditribution of ynthetic data et in log-log cale. Zipf ditribution. A ummary of the parameter etting ued to generate the ynthetic data et i preented in Table, where T i the number of tranaction, N i the number of item, C i the contant of a generalized Zipf ditribution, and P i the exponent of a generalized Zipf ditribution. The real data et were obtained from everal different application domain. Table how ome characteritic of thee data et. The firt five data et in the table, i.e., pumb, pumb, che, muhroom, and connect are often ued a benchmark for evaluating the performance of aociation rule algorithm on dene data et. The pumb Table : Real Data Set Characteritic. Data et #Item #Record Source Pumb 3 4946 IBM Almaden Pumb 89 4946 IBM Almaden Che 75 396 UCI Repoitory Muhroom 9 84 UCI Repoitory Connect 7 67557 UCI Repoitory LA 974 34 TREC-5 Retail 446 5767 Retail Store and pumb data et correpond to binarized verion of a cenu data et from IBM. The difference between them i that pumb doe not contain item with upport greater than 8%. The che, muhroom, and connect data et are benchmark data et from UCI machine learning repoitory. The LA data et i part of the TREC-5 collection (http://trec.nit.gov) and contain new article from the Lo Angele Time. Finally, retail i a maked data et obtained from a large mail-order company. Experimental Platform: We implemented TAPER uing C++ and all experiment were performed on a Sun Ultra worktation with a 44 MHz CPU and 8 Mbyte of memory running the SunOS 5.7 operating ytem. 6. TAPER v. the Brute-force Approach. In thi ubection, we preent a performance comparion between the TAPER algorithm and a brute-force approach uing everal benchmark data et from IBM, a UCI machine learning repoitory, and ome other ource, uch a retail tore. The implementation of the brute-force approach i Thee data et are obtained from IBM Almaden at http://www.almaden.ibm.com/c/quet/demo.html. Thee data et and data content decription are available at http://www.ic.uci.edu/ mlearn/mlrepoitory.html

Prunning Ratio.9.8.7.6.5.4.3.. Prunning Ratio.9.8.7.6.5.4.3.. Prunning Ratio.9.8.7.6.5.4.3....3.4.5.6.7.8.9 Minimum Correlation Threhold..3.4.5.6.7.8.9 Minimum Correlation Threhold..3.4.5.6.7.8.9 Minimum Correlation Threhold (a) Connect (b) Muhroom (c) Che Figure 9: The pruning effect of TAPER on UCI Connect, Muhroom, Che data et. imilar to that of the TAPER algorithm except that the filtering mechanim implemented in the TAPER algorithm i not included in the brute-force approach. Figure 7 how the relative computation performance of the TAPER algorithm and the brute-force approach on the pumb, pumb, and retail data et. A can be een, the performance of the brute-force approach doe not change much for any of the three data et. However, the execution time of the TAPER algorithm can be an order of magnitude fater than the brute-force approach even if the minimum correlation threhold i low. For intance, a hown in Figure 7 (a), the execution time of TAPER on the pumb data et i one order of magnitude le than that of the bruteforce approach at the correlation threhold.4. Alo, when the minimum correlation threhold increae, the execution time of TAPER dramatically decreae on the pumb data et. Similar computation effect can alo be oberved on the pumb and retail data et although the computation aving on the retail data et i not a ignificant a it i on the other two data et. To better undertand the above computation effect, we alo preent the pruning ratio of the TAPER algorithm on thee data et in Figure 8. A can be een, the pruning ratio of TAPER on the retail data et i much maller than that on the pumb and pumb data et. Thi maller pruning ratio explain why the computation aving on retail i le than that on the other two data et. Alo, Figure 9 how the pruning ratio of TAPER on UCI connect, muhroom, and che data et. The pruning ratio achieved on thee data et are comparable with the pruning ratio we obtained on the pumb data et. Thi indicate that TA- PER alo achieve much better computation performance than the brute-force approach on UCI benchmark data et. 6. The Effect of Correlation Threhold In thi ubection, we preent the effect of correlation threhold on the computation aving of the TAPER algorithm. Recall that our algebraic cot model how that the pruning ratio of the TAPER algorithm increae with increae of the correlation threhold for data et with linear and Zipf-like ditribution. Figure 8 how uch an increaing trend of the pruning ratio on the pumb, pumb, and retail data et a correlation threhold increae. Alo, Figure 9 how a imilar increaing trend of the pruning ratio on the UCI benchmark dataet including muhroom, che, and connect. One common feature of all the above data et i the kewed nature of their rank-upport ditribution. A a reult, thee experimental reult till exhibit a imilar trend a the propoed algebraic cot model although the rankupport ditribution of thee dataet do not follow Zipf law exactly. Support Support Table 3: Group of item for the Retail data et Group I II III # Item 47 47 47 # Tranaction 5767 5767 5767 a/m 38 849 4778..8.5..5. The Rank-Support Ditribution of Retail Dataet 4 8 Item orted by upport (a) Retail Dataet The Support Ditribution of Group II Trendline: y=.3488 -.38x 3 4 Item orted by upport (c) Group II Support Support.5... The Support Ditribution of Group I Trendline: y=.439 -.36x 3 4 Item orted by upport (b) Group I The Support Ditribution of Group III Trendline: y=.6997 -.3369x 3 4 Item orted by upport (d) Group III Figure : The plot of the rank-upport ditribution of the retail data et and it three item group with a linear regreion fitting line (trendline). 6.3 The Effect of the Slope m Recall that the algebraic cot model for data et with a linear rank-upport ditribution provide rule which indicate that the pruning ratio of the TAPER algorithm increae with the decreae of the ratio a/m and the pruning ratio increae with the increae of the correlation threhold. In thi ubection, we empirically evaluate the effect of the

.8.7 Pruning Ratio.6.5.4.3. Correlation Threhold =.9 Correlation Threhold =.8 Correlation Threhold =.3 upport of item (log cale).... 9 8 7 a/m ratio 6 5. rank of item (log cale) Figure : Pruning ratio with the decreae of a/m for data et with linear rank-upport ditribution. Figure 3: The plot of the rank-upport ditribution of the LA data et in log-log cale. ratio a/m on the performance of the TAPER algorithm for data et with a linear rank-upport ditribution. Firt, we generated three group of data from the retail data et by orting all the item in the data et in non-decreaing order and then partitioning them into four group. Each of the firt three group contain 47 item and the lat group contain 36 item. The firt three group are the group data et hown in Table 3. Figure (a) how the plot of the rank-upport ditribution of the retail data et and Figure (b), (c), and (d) how the plot of the rank-upport ditribution of three group of data generated from the retail data et. A can be een, the rank-upport ditribution of the three group approximately follow a linear ditribution. Table 3 lit ome of the characteritic of thee data et group. Each group ha the ame number of item and tranaction but a different a/m ratio. Group I ha the highet a/m ratio and Group III ha the lowet a/m ratio. Since the major difference among thee three data et group i the ratio a/m, we can apply thee data et to how the impact of the a/m on the performance of the TAPER algorithm. Figure how the pruning ratio of the TAPER algorithm on the data et with linear rank-upport ditribution. A can be een, the pruning ratio increae a the a/m ratio decreae at different correlation threhold. The pruning ratio alo increae a correlation threhold are increaed. Thee experimental reult confirm the trend exhibited by the cot model. Pruning Ratio.8.6.4. Correlation Threhold =.9 Correlation Threhold =.6 Correlation Threhold =.3..4.6.8 The exponent p. Figure : The increae of pruning ratio with the increae of p for data et with Zipf-like ditribution. 6.4 The Effect of the Exponent p In thi ubection, we examine the effect of the exponent P on the performance of the TAPER algorithm for data et with a generalized Zipf rank-upport ditribution. We ued the ynthetic data et preented in Table for thi experiment. All the ynthetic data et in the table have the ame number of tranaction and item. The rank-upport ditribution of thee data et follow Zipf law but with different exponent P. Figure diplay the pruning ratio of the TAPER algorithm on data et with different exponent P. Again, the pruning ratio of the TAPER algorithm increae with the increae of the exponent P at different correlation threhold. Alo, we can oberve that the pruning ratio of the TAPER algorithm increae with the increae of the correlation threhold. Recall that the propoed algebraic cot model for data et with a generalized Zipf ditribution provide two rule which confirm the above two obervation. Pruning Ratio.8.6.4. Correlation Threhold =.8 Correlation Threhold =.6 Correlation Threhold =.3 8 4 3 Number of item Figure 4: The effect of databae dimenion on the pruning ratio for data et with Zipf-like rankupport ditribution. 6.5 The Scalability of TAPER In thi ubection, we how the calability of the TAPER algorithm with repect to databae dimenion. Figure 3 how the plot of the rank-upport ditribution of the LA data et in log-log cale. Although thi plot doe not follow Zipf law exactly, it doe how Zipf-like behavior. In other word, the LA data et ha an approximate Zipf-like ditribution with the exponent P =.46. In thi experiment, we generated three data et, with, 8, and 4 item repectively, from the LA data et by random ampling on the item et. Due to the random ampling, the three data et can have almot the ame rank-upport ditribution a the LA data et. A a reult, we ued thee three

Execution Time (ec) 4 35 3 5 5 5 Correlation Threhold =.3 Correlation Threhold =.6 Correlation Threhold =.8 8 4 3 Number of item Figure 5: The effect of databae dimenion on the execution time for data et with Zipf-like rankupport ditribution. generated data et and the LA data et for our cale-up experiment. For data et with Zipf-like rank-upport ditribution, Figure 4 how the effect of databae dimenion on the performance of the TAPER algorithm. A can be een, the pruning ratio of the TAPER algorithm how almot no change or lightly increae at different correlation threhold. Thi indicate that the pruning ratio of the TAPER algorithm can be maintained when the number of item i increaed. Recall that the propoed algebraic cot model for data et with a generalized Zipf ditribution exhibit a imilar trend a the reult of thi experiment. Finally, in Figure 5, we how that the execution time for our cale-up experiment increae linearly with the increae of the number of item at everal different minimum correlation threhold. 7. COLUSIONS AND FUTURE WORK In thi paper, we propoed uing an upper bound of the φ correlation coefficient, which how a conditional monotonic property. Baed on thi upper bound, we deigned an efficient two-tep filter-and-refine algorithm, called TAPER, to earch all the item pair with correlation above a uerpecified minimum correlation threhold. In addition, we provided an algebraic cot model to quantify the computation aving of TAPER. A demontrated by our experimental reult on both real and ynthetic data et, the pruning ratio of TAPER can be maintained or even increae with the increae of databae dimenion, and the performance of TAPER confirm the propoed algebraic cot model. There are everal potential direction for future reearch. Firt, we plan to generalize the TAPER algorithm a a tandard algorithm for efficient computation of other meaure of aociation. In particular, we will examine the potential upper bound function of other meaure for their monotone property. Second, we propoe to extend our methodology to anwer correlation-like querie beyond pair of item. Finally, we will extend the TAPER algorithm to find all pair of high negatively correlated item. 8. ACKNOWLEDGMENTS Thi work wa partially upported by NASA grant # C 3, DOE/LLNL W-745-ENG-48, and by Army High Performance Computing Reearch Center under the aupice of the Department of the Army, Army Reearch Laboratory cooperative agreement number DAAD9--- 4. The content of thi work doe not necearily reflect the poition or policy of the government and no official endorement hould be inferred. Acce to computing facilitie wa provided by the AHPCRC and the Minneota Supercomputing Intitute. 9. REFEREES [] R. Agrawal, T. Imielinki, and A. Swami. Mining aociation rule between et of item in large databae. In ACM SIGMOD, 993. [] R. Bayardo, R. Agrawal, and D. Gunopulo. Contraint-baed rule mining in large, dene databae. Data Mining and Knowledge Dicovery Journal, page 7 4,. [3] S. Brin, R. Motwani, and C. Silvertein. Beyond market baket: Generalizing aociation rule to correlation. In ACM SIGMOD, 997. [4] C. Bucila, J. Gehrke, D. Kifer, and W. M. White. Dualminer: a dual-pruning algorithm for itemet with contraint. In ACM SIGKDD,. [5] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemet algorithm for tranactional databae. In ICDE,. [6] E. Cohen, M. Datar, S. Fujiwara, A. Gioni, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding intereting aociation without upport pruning. In ICDE,. [7] W. DuMouchel and D. Pregibon. Empirical baye creening for multi-item aociation. In ACM SIGKDD,. [8] G. Grahne, L. V. Lakhmanan, and X. Wang. Efficient mining of contrained correlated et. In ICDE,. [9] J. Han, J. Pei, and Y. Yin. Mining frequent pattern without candidate generation. In ACM SIGMOD,. [] C. Jermaine. The computational complexity of high-dimenional correlation earch. In ICDM,. [] C. Jermaine. Playing hide-and-eek with correlation. In ACM SIGKDD, 3. [] S. K. Kachigan. Multivariate Statitical Analyi: A Conceptual Introduction. Radiu Pre, 99. [3] R. Ng, L. Lakhmanan, J. Han, and A. Pang. Exploratory mining via contrained frequent et querie. In ACM SIGMOD, 999. [4] R. Ratogi and K. Shim. Mining optimized aociation rule with categorical and numeric attribute. IEEE TKDE, 4(), January. [5] H. T. Reynold. The Analyi of Cro-claification. The Free Pre, New York, 977. [6] R. Rymon. Search through ytematic et enumeration. In Int l. Conf. on Principle of Knowledge Repreentation and Reaoning, 99. [7] H. Xiong,, S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-tep approach for all-pair correlation query in tranaction databae. In Technical Report 3-, computer cience and engineering, Univerity of Minneota - Twin Citie, May 3. [8] G. Zipf. Human Behavior and Principle of Leat Effort: An Introduction to Human Ecology. Addion Weley, Cambridge, Maachuett, 949.