Efficient Approximate Similarity Search Using Random Projection Learning

Size: px
Start display at page:

Download "Efficient Approximate Similarity Search Using Random Projection Learning"

Transcription

1 Efficient Approximate Similarity Search Using Random Projection Learning Peisen Yuan, Chaofeng Sha, Xiaoling Wang 2,BinYang, and Aoying Zhou 2 School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 2433, P.R. China 2 Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, Shanghai 262, P.R. China Abstract. Efficient similarity search on high dimensional data is an important research topic in database and information retrieval fields. In this paper, we propose a random projection learning approach for solving the approximate similarity search problem. First, the random projection technique of the locality sensitive hashing is applied for generating the high quality binary codes. Then the binary code is treated as the labels and a group of SVM classifiers are trained with the labeled data for predicting the binary code for the similarity queries. The experiments on real datasets demonstrate that our method substantially outperforms the existing work in terms of preprocessing time and query processing. Introduction The similarity search, also known as the k-nearest neighbor query, is a classical problem and core operation in database and information retrieval fields. The problem has been extensively studied and applied in many fields, such as content-based multimedia retrieval, time series and scientific database, text documents etc. One of the common characteristics of these kinds of data is high-dimensional. Similarity search on the high-dimensional data is a big challenge due to the time and space demands. However, in many real applications, the approximate results obtained in a more constrained time and space setting can also satisfy the users requirements. For example, in the content-based image retrieval problem, a similar image can be returned as the result. Recently, researchers propose the approximate way for the similarity query, which can provide the satisfying results with much efficiency improvement [ 5]. The locality sensitive hashing (LSH for short) [] is an efficient way for processing similarity search approximately. The principle of LSH is that the more similar the objects, the higher probability they fall into the same hashing bucket. The random projection technique of LSH is designed to approximately evaluate the cosine similarity between vectors, which transforms the high-dimensional data into much lower dimensions with the compact bit vectors. However, the LSH is a data-independent approach. Recently, the learning-based data-aware methods, such as semantic hashing [6], are proposed and improve the search H. Wang et al. (Eds.): WAIM 2, LNCS 6897, pp , 2. c Springer-Verlag Berlin Heidelberg 2

2 58 P. Yuan et al. efficiency with much shorter binary code. The key of the learning-based approaches is designing a way to obtain the binary codes for the data and the query. For measuring the quality of the binary code, the entropy maximizing criterion is proposed [6]. The state-of-the-art of learning-based technique is self-taught hashing (STH for short) [7], which converts the similarity search into a two stage learning problem. The first stage is the unsupervised learning of the binary code and the second one is supervised learning with the two-classes classifiers. In order to obtain the binary code satisfying the entropy maximizing criterion, the adjacency matrix of the k-nn graph of the dataset is constructed firstly. After solving the matrix by the binarised Laplacian Eigenmap, the median of the eigenvalues is set as the threshold for getting the bit labels: if the eigenvalue is larger than the threshold, the corresponding label is, otherwise is. In the second stage, binary code of the objects is taken as the class label, then classifiers are trained for predicting the binary code for the query. However, the time and space complexity of the preprocessing stage is considerably high. In this paper, based on the filter-and-refine framework, we propose a random projection learning (RPL for short) approach for the approximate similarity search problem, which requires much less time and space cost for acquiring the binary code in the preprocessing step. Firstly, the random projection technique is used for obtaining the binary code of the data objects. The binary codes are used as the labels of the data objects andthenthel SVM classifiers are trained subsequently, which are used for predicting the binary labels for the queries. We prove that the binary code after random projection satisfies the entropy maximizing criterion which is required by semantic hashing. Theoretical analysis and empirical experiment study on the real datasets are conducted, which demonstrate that our method gains comparable effectiveness with much less time and space cost. To summarize, the main contributions of this paper are briefly outlined as follows: Random projection learning method for similarity search is proposed, and the approximate k-nearest neighbor query is studied. Properties of random projection of LSH that satisfying the entropy maximizing criterion needed by the semantic hashing is proved. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of our methods. The rest of paper is organized as follows. The random projection learning method for approximate similarity search is presented in Section 2. An extensive experimental evaluation is introduced in Section 3. In Section 4, the related work is briefly reviewed. In Section 5, the conclusion is drawn and future work is summarized. 2 Random Projection Learning 2. The Framework The processing framework of RPL is described in Figure (a). Given data set S, firstly, the random projection technique of LSH is used for obtaining the binary code. After that, the binary code is treated as the labels of the data objects. Then l SVM classifiers

3 Efficient Approximate Similarity Search 59 Data Objects Train SVM Random Sketching Objects l User Query Predict (a) Processing framework of RPL Binary Vectors (b) SVM training on LSH signature vectors Fig.. The processing framework and the training with signature are trained subsequently, which are used for predicting the binary labels for the queries. The binary code after random projection satisfying the entropy maximizing criterion in section 5. To answer a query, the binary labels of the query are learned with these l classifiers firstly. Then the similarities are evaluated in the hamming space, and hamming distance between the query less than a threshold is treated as the candidates. The distances or similarities are evaluated on the candidate set. Results are re-ranked and returned finally. In this paper, we take the approximate k-nn query into account. Since there is no need of computing the k-nn graph, LapEng and median, our framework can efficiently answer the query with much less preprocessing time and space consumption comparing with STH [7]. In this paper, the cosine similarity and Euclidean distance are used as the similarity metric and the distance metric without pointing out specifically. 2.2 Random Projection M. Charikar [2] proposes the random projection technique using random hyperplanes, which preserves the cosine similarity between vectors in lower space. The random projection is a method of locality sensitive hashing for dimensionality reduction, which is a powerful tool designed for approximating the cosine similarity. Let u and v be two vectors in R d and θ(u, v) be the angle between them. The cosine similarity between u and v is defined as Eq.. cosine(θ(u, v)) = u v () u v Given a vector u R d, a random vector r is randomly generated with each component randomly choosing from the standard normal distribution N(, ). Each hash function h r of the random projection LSH family H is defined as Eq.2. { :if r u ; h r (u) = :otherwise. Given the hash family H and vectors u and v, Eq.3 can be obtained [2]. Pr[h r (u) =h r (v)] = θ(u, v) π (2) (3)

4 52 P. Yuan et al. Form Eq.3, the cosine similarity between u and v can be approximately evaluated with Eq.4. cosine(θ(u, v)) = cosine(( Pr[h r (u) =h r (v)])π) (4) For the random projection of LSH, l hash functions h r,,h rl are chosen from H. After hashing, the vector u can be represented by the signature s as Eq.5. s = {h r (u),,h rl (u)}. (5) The more similar of the data vectors, the higher probability they are projected with the same labels. 2.3 The Algorithm The primary idea of RPL is that:() similar data vectors have similar binary code after random projection, and the disparity of binary code predicted for the query with the data vector set should be small; (2) The smaller distance between the vectors, the high chance they belong to he same class. Before introducing the query processing, the definitions used in the following sections are introduced firstly. Definition. Hamming Distance. Given two binary vectors v and v 2 with equal length L, their Hamming distance is defined as H dist (v, v 2 )= L i= v [i] v 2 [i]. Definition 2. Hamming Ball Coverset. Given an integer R, a binary vector v and a vector set V b, the Hamming Ball Coverset relative to R of v is denoted as BC R (v) = {v i v i V b,andh dist (v, v i ) R },wherer is Radius. The intuitive meaning of Hamming Ball Coverset is the set of all the binary vectors in the set V b whose Hamming distance to v is less than or equal to R. Obtaining the Binary Vector. The algorithm of random projection used for generating the binary code is illustrated in Algorithm. First a random matrix R d l is generated (line ). The entry of the vector of the =, j = i,,l.for each vector v of the set V, the inner products with each vector of the matrix R d l are evaluated (lines 3 - ). Therefore, the signatures of v V can be obtained as Eq. 5. After projecting all the objects, a signature matrix S R n l can be obtained. Row of the signature matrix represents the object vector and the column represents the LSH signatures. In order to access the signature easily, an inverted list of the binary vector is built based the LSH signatures (line ). Each signature vector v b of the corresponding vector is a binary vector {, } l. Finally, the inverted list of binary vectors returned (line ). The similarity value of Eq.4 can be normalized between and. matrix is chosen from N(, ) and normalized, i e., d i= r2 ij

5 Efficient Approximate Similarity Search 52 Algorithm. Random Projection of LSH Input: Object vector set V, V R d. Output: Inverted List of Binary Vector (ILBV ). ILBV = ; generate a normalized random matrix R d l ; foreach v Vdo v b = ; foreach vector r of R d l do h = r v; val = sgn(h); v b.append(val); ILBV.add(v b ); return ILBV ; Training Classifiers. The support vector machine (SVM) classifying technique is a classic classifying technique in data mining field. In this paper, the SVM classifiers are used and trained with the labeled data vectors. Given labeled training data set (x i,y i ),i =,,n,wherex i R d and y {, } l. The solution of SVM is formalized as the following optimization problem. 2 wt w + C n min ξ i w,b,ξ i i= (6) subject to y i w T x i ξ i The learning procedure after obtaining the binary labels is presented in Figure (b). The x-axis represents the binary vectors of the LSH signatures, and the y-axis represents the data object set 2. As illustrated in Figure (b), the column of the binary vectors is used as the class labels of the objects, then the first SVM classifier is trained. Since the length of signature vector is l, l SVM classifiers can be trained in the same manner. The classifier training algorithm used in RPL is described in Algorithm 2. For each column of the signature matrix, an SVM classifier is trained at first (lines -4). In the same manner, l classifiers are trained. Finally, these l classifiers are returned, which are denoted as (w j,b j ),forj =,, l (line 5). Processing Query. The procedure for processing the approximate k-nn query is summarized in Algorithm 3. The algorithm consists of two stages: filter and refine. In the filter stage, the binary vector for the query is obtained with the l SVM classifiers firstly (lines 4-6). After that, the Hamming distances of the query with each binary vector in ILBV are evaluated (lines 7-9) and the distances are sorted (line ). The objects whose Hamming distance larger than R are filtered in this stage. In the refine stage, the Euclidean distances are evaluated on the candidate set which fall in the Hamming Ball Coverset with radius R (lines -4). The top-k results are returned after being sorted finally (lines 5-6). 2 label {,} can be transformed to {-,} plainly.

6 522 P. Yuan et al. Algorithm 2. SVM Training of RPL Input: Object vector v i V,i=,,n,V R d ; ILBV. Output: l SVM Classifiers. for (j = ; j l; j++) do foreach v i Vdo v b [i] =get(ilbv,v i); SV MTrain(v i, v b [ij]); 5 return (w j,b j),j =,, l; Algorithm 3. Approximate k-nn Query Algorithm Input:Queryq; l SVM Classifiers; ILBV, Hamming Ball Radius R,Integerk. Output: Top-k Result List. HammingResult = ; 2 Result = ; 3 Vector q v = new Vector();//The query binary vector 4 for (i = ; i l; i++) do 5 b = SV MPredict(q,svm i); 6 q v = q v.append(b); 7 foreach v b ILBV do 8 H dist = HammingBallDist(v b, q v ); 9 HammingResult = HammingResult H dist ; sort(hammingresult); Select the vectors Ṽ = BCR(q v ) ; foreach v Ṽ do 3 distance = dist(v,q); 4 Result = Result distance; 5 sort(result); 6 return Top-k result list; 2.4 Complexity Analysis Suppose the object vector v R d, and the length of the binary vector after random projection is l. In the algorithm, for generating the matrix R d l, its time and space complexity is O(dl) and O(dl) respectively. Let z be the number of non-zero values per data object. Training l SVM classifiers takes O(lzn) or even less [8]. For the query processing, algorithm 3 predicts l bits with the l SVM classifiers takes O(lznlog n) time complexity. Let the size of the Hamming Ball Coverset BC R (q) = C, the evaluation of the candidate with the query takes O(Cl). The sort step takes O(n log n). Therefore, the query complexity is O(lzn log n + Cl + n log n). The time and space complexity of the preprocessing step of STH is O(n 2 + n 2 k + lnkt + ln) and O(n 2 ) respectively [7]. Thus the time complexity of the preprocessing of our method O(dl) O(n 2 + n 2 k + lnkt + ln), and space complexity O(dl) O(n 2 ) also, where l is much smaller than n.

7 Efficient Approximate Similarity Search Experimental Study 3. Experimental Setup All the algorithms are implemented in Java SDK.6 and run on an Ubuntu.4 enabled PC with Intel duo core E GHz CPU and 4G main memory. The SVM Java Package [9] is used and the kernel function is configured to linear kernel function with other default configurations. In term of the dataset, the following two datasets are used. WebKB [] contains web pages collected from the computer science departments of various universities by the World Wide Knowledge Base project, which is widely used for text mining. It includes 283 pages for training, which are mainly classified into faculty, staff, department, course and project 5 classes. Reuters-2578 [] is a collection of news documents used for text mining, which includes 2578 documents and 7 topics. Due to space limited when solving the k-nn matrix problem, we randomly select 283 documents which include acq, crude, earn, grain, interest, money-fx, ship, trade 8 classes. The vectors are generated with TF-IDF vector model [2] after removing the stopwords and stemming. In total, 5 queries are issued on each dataset and the average results are reported. The statistics of the datasets are summarized in the table. Table. Statistics of the datasets Dataset Document No. Vector Length Class No. WebKB Reuters For evaluating the effectiveness, the k-precision metric is used and defined as k precision = ann k ennk k,whereann k is the result list of the approximate k-nn query, and enn k is the k-nn result list by the linear scaning. 3.2 Effectiveness To evaluate the query effectiveness of the approximate k-nn query algorithm, we test the k-precision varying the bit length L and the radius R. In this experiment, the parameter k of the k-precision is set to 4. The binary length L varies from 4 to 28, the radius R is selected between to 2. The experiments results of the effectiveness of RPL on the two datasets are presented in Figure 2. From the Figure 2, we can observe that () for a certain radius R, with the increasing of the parameter L, the precision drops quickly. For example, when R =8,the k-precision drops from. to.2 when the length L variesfromto28onthetwo datasets. (2) given the parameter L, increasing the radius can improve the precision. Considering when the length L is 5, the precision increases from. to. with the radius R varies from to 2. The reason that the bigger the radius R, more candidates are survived after filtering, thus the higher precision.

8 524 P. Yuan et al. k-precision R= R=2 R=4 R=8 R=2 k-precision R= R=2 R=4 R=8 R= (a) WebKB (b) Retures Fig. 2. k-precision on two datasets Time (ms) R= R=2 R=4 R=8 R= (a) WebKB Time (ms) R= R=2 R=4 R=8 R= (b) Retures Fig. 3. Query efficiency on the two Datasets 3.3 Efficiency To evaluate the efficiency of RPL, the approximate k-nn query processing time is tested when varying the radius R and the bit length L. The test result is demonstrated in Figure 3. In this experiment, the radius R varies from to 2, and the bit length is selected between 4 and 3. From the performance evaluation, conclusions can be drawn that: () with the increasing of the bit length L, the query processing time drop sharply, because there are fewer candidates; (2) the bigger the radius are, the higher query processing time for there are more candidates. But it is much smaller than the linear scan, which takes about 737 and 264 ms on the WebKB and Reuters datasets respectively. 3.4 Comparison Comparisons with STH [7] are made on both of effectiveness and efficiency. STH takes about 9645 and 76 seconds on WebKB and Retures datasets respectively for preprocessing, i.e., evaluating k-nn graph and the the Eigenvectors of the matrix. However, the preprocessing time of our method is much less than STH, only takes only several milliseconds, i.e., only a random matrix is needed to generate. In this comparison, the parameter k of the k-nn graph used in STH is set to 25. The effectiveness comparison with STH is shown in Figure 4. The radius R is set to 3, and the parameter k to 3 and. The k-precision is reported varying bit length L on the two datasets. The experiment results show that the precision of RPL is a bit lower

9 Efficient Approximate Similarity Search 525 k-precision RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters (a) 3-Precision k-precision RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters (b) -Precision Fig. 4. Precision comparison Time(ms) RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters Time(ms) RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters (a) R= (b) R= RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters 25 2 RPL,WebKB RPL,Reuters STH,WebKB STH,Reuters Time(ms) 5 Time(ms) (c) R= (d) R=4 Fig. 5. Performance comparison with different R than the STH, because STH use the k-nn graph of the dataset, and this take the data relationship into the graph. The efficiency comparison with STH is conducted and the results are shown in Figure 5. In this experiment, the bit length varies from 4 to 6, results show the query processing time when varying the the radius R and different bit length L. Results demonstrate that the performance of our method is better than STH, which indicates better filtering skill. 4 Related Work The similarity search problem is well studied in low dimensional space with the space partitioning and data partitioning index techniques [3 5]. However, the efficiency

10 526 P. Yuan et al. degrades after the dimension is large than [6]. Therefore, the researches propose the approximate techniques to process the problem approximately. Approximate search method improves the efficiency by relaxing the precision quality. MEDRANK [7] solves the nearest neighbor search by sorted list and TA algorithm [8] in an approximate way. Yao et al.[9] study the approximate k-nn query in relation database. However, both MEDRANK and [9] focus on much lower dimensional data (from 2 to ). Locality sensitive hashing is a well-known effective technique for approximate nearest neighbor search [, 3], which ensures that the more similar of the data objects, the higher probability they are hashed into the same buckets. The near neighbor of the query can be found in the candidate bucket with high probability. The space filling techniques convert high-dimensional data into one dimensional while preserving their similarity. One kind of space filling is Z-order curve, which is built by connecting the z-value of the points [2]. Thus the k-nn search for a query can be translated into range query on the z-values. Another space filling technique is Hilbert curve [2]. Based on Z-order curve and LSH, Tao et al.[4] propose lsb-tree for fast approximate nearest neighbor search in high dimensional space. The random projection method of LSH is designed for approximately evaluating similarity between vectors. Recently, CompactProjection [5] employs it for the contentbased image similarity search. Recently, the data-aware hashing techniques are proposed which demand much less bits with the machine learning skill [6, 7, 22]. Semantic hashing [6] employs the Hamming distance of the compact binary codes for semantic similarity search. Self-taught hashing [7] proposes a two stage learning method for similarity search with much less bit length. Nevertheless, the preprocessing stage takes too much time and space for evaluating and storing k-nn graph and solving the LagMap problem. It s not proper for the data evolving setting and also its scalability is not well especially for the large volume of data. Motivated by the above research works, the random projection learning method is proposed, which is proved to satisfy the entropy maximizing criterion nicely. 5 Conclusions In this paper, a learning based framework for similarity search is studied. According to the framework, the data vectors are randomly projected into binary code firstly, and then the binary code are employed as the labels. The SVM classifiers are trained for predicting the binary code for the query. We prove that the binary code after random projection satisfying the entropy maximizing criterion. The approximatek-nn query is effectively evaluated within this framework. Experimental results show that our method achieves better performance comparing with the existing technique. Though the the query effectiveness is bit lower than the STH, RPL takes much smaller time and space complexity than STH in the preprocessing step and the query gains better performance, and this is very important especially in the data-intensive environment. Acknowledgments. This work is supported by NSFC grants (No and No. 6934), 973 program (No. 2CB3286), Shanghai International Cooperation

11 Efficient Approximate Similarity Search 527 Fund Project (Project No ), Program for New Century Excellent Talents in University (No. NCET--388) and Shanghai Leading Academic Discipline Project (No. B42). References. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp (998) 2. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp (22) 3. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp MIT, Cambridge (26) 4. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp (29) 5. Min, K., Yang, L., Wright, J., Wu, L., Hua, X.S., Ma, Y.: Compact Projection: Simple and Efficient Near Neighbor Search with Practical Memory Requirements. In: CVPR, pp (2) 6. Salakhutdinov, R., Hinton, G.: Semantic Hashing. International Journal of Approximate Reasoning 5(7), (29) 7. Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: SIGIR, pp (2) 8. Joachims, T.: Training linear SVMs in linear time. In: SIGKDD, pp (26) 9. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2), cjlin/libsvm. World Wide Knowledge Base project (2), webkb/. Reuters2578 (999), html 2. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, Reading (999) 3. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 8(9), 57 (975) 4. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp (984) 5. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD 9(2), (99) 6. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp (998) 7. Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: SIGMOD, pp (23) 8. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66(4), (23) 9. Yao, B., Li, F., Kumar, P.: k-nearest neighbor queries and knn-joins in large relational databases (almost) for free. In: ICDE, pp. 4 5 (2) 2. Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the UB-tree into a database system kernel. In: VLDB, pp (2) 2. Liao, S., Lopez, M., Leutenegger, S.: High dimensional similarity search with space filling curves. In: ICDE, pp (2) 22. Baluja, S., Covell, M.: Learning to hash: forgiving hash functions and applications. Data Mining and Knowledge Discovery 7(3), (28) 23. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. NIPS 2, (29)

12 528 P. Yuan et al. Appendix: Theoretical Proof The entropy H of a discrete random variable X with possible values x,,x n is defined as H(X) = n i= p(x i)i(x i )= n i= p(x i)logp(x i ), where I(x i ) is the self-information of x i. For the binary random variable X, assume that P (X =)=p and P (X =)= p, the entropy of X can be represented as Eq.7. H(X) = p log p ( p) log( p), (7) when p =/2, Eq.7 attains its maximum value. Semantic hashing [6] is an effective solution for similarity search with learning technique. For ensuring the search efficiency, an elegant semantic hashing should satisfy entropy maximizing criterion [6, 7]. The intuitive meaning of entropy maximizing criterion is that the datasets are uniformly represented by each bit, thus maximizing the information of each bit, i.e., bits are uncorrelated and each bit is expected to fire 5% [23]. Thus, we propose the following property for semantic hashing. Property. Semantic hashing should satisfy entropy maximizing criterion to ensure efficiency, i.e., the chance of each bit to occur is 5% and each bit is uncorrelated. In the following section, we prove that the binary code after random projection of LSH naturally satisfies the entropy maximum criterion. This is the key differencebetween our method and STH [7] in the generating binary code step. The latter needs considerable space and time consumption. Let R d be a d-dimensions real data space and u R d is normalized into the unit vector, i.e., d i= u2 i =. Suppose a random vector r Rd, r i is the i-th component of r and chosen randomly from the standard normal distribution N(, ), i =,,d. Let v = u r and v is random variable. Then the following lemma holds. Lemma. Let random variable v = u r,thenv N(, ). Proof. Since v = u r, thus v = d i= u i r i. Since r i is a random variable and drawn randomly and independently from the standard normal distribution N(, ), accordinglywe have u i r i N(, u 2 i ). Thus, according to the property of standard normal distribution, the random variable v N(, d i= u2 i )=N(, ). Hence the lemma is proved. From lemma, the corollary can be derived, which means that the binary code after random projection is uncorrelated.

13 Efficient Approximate Similarity Search 529 Corollary. The binary code after random projection is independent. Lemma 2. Let f(x) =sgn(x) be a function defined on the real data set R, a random variable v = u r. Set random variable v = f(v), thenp (v =)=P (v =)= 2. The prove of Lemma 2 is omitted here due to space limit. The lemma 2 means that the and occur with the same probability in the signature vector. Hence, on the basis of corollary and lemma 2, we have the following corollary. Corollary 2. The binary code of after random projection satisfies entropy maximizing criterion.

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Anshumali Shrivastava Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 4853, USA

More information

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

The Power of Asymmetry in Binary Hashing

The Power of Asymmetry in Binary Hashing The Power of Asymmetry in Binary Hashing Behnam Neyshabur Payman Yadollahpour Yury Makarychev Toyota Technological Institute at Chicago [btavakoli,pyadolla,yury]@ttic.edu Ruslan Salakhutdinov Departments

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Big Data. Lecture 6: Locality Sensitive Hashing (LSH) Big Data Lecture 6: Locality Sensitive Hashing (LSH) Nearest Neighbor Given a set P of n oints in R d Nearest Neighbor Want to build a data structure to answer nearest neighbor queries Voronoi Diagram

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

THE concept of Big Data refers to systems conveying

THE concept of Big Data refers to systems conveying EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing

A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao

More information

A ranking SVM based fusion model for cross-media meta-search engine *

A ranking SVM based fusion model for cross-media meta-search engine * Cao et al. / J Zhejiang Univ-Sci C (Comput & Electron) 200 ():903-90 903 Journal of Zhejiang University-SCIENCE C (Computers & Electronics) ISSN 869-95 (Print); ISSN 869-96X (Online) www.zju.edu.cn/jzus;

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&

More information

Cluster Analysis for Optimal Indexing

Cluster Analysis for Optimal Indexing Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.

More information

Learning Binary Hash Codes for Large-Scale Image Search

Learning Binary Hash Codes for Large-Scale Image Search Learning Binary Hash Codes for Large-Scale Image Search Kristen Grauman and Rob Fergus Abstract Algorithms to rapidly search massive image or video collections are critical for many vision applications,

More information

Which Space Partitioning Tree to Use for Search?

Which Space Partitioning Tree to Use for Search? Which Space Partitioning Tree to Use for Search? P. Ram Georgia Tech. / Skytree, Inc. Atlanta, GA 30308 p.ram@gatech.edu Abstract A. G. Gray Georgia Tech. Atlanta, GA 30308 agray@cc.gatech.edu We consider

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Visualization by Linear Projections as Information Retrieval

Visualization by Linear Projections as Information Retrieval Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006 Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada mariusm@cs.ubc.ca,

More information

How much can Behavioral Targeting Help Online Advertising? Jun Yan 1, Ning Liu 1, Gang Wang 1, Wen Zhang 2, Yun Jiang 3, Zheng Chen 1

How much can Behavioral Targeting Help Online Advertising? Jun Yan 1, Ning Liu 1, Gang Wang 1, Wen Zhang 2, Yun Jiang 3, Zheng Chen 1 WWW 29 MADRID! How much can Behavioral Targeting Help Online Advertising? Jun Yan, Ning Liu, Gang Wang, Wen Zhang 2, Yun Jiang 3, Zheng Chen Microsoft Research Asia Beijing, 8, China 2 Department of Automation

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Multi-Class Active Learning for Image Classification

Multi-Class Active Learning for Image Classification Multi-Class Active Learning for Image Classification Ajay J. Joshi University of Minnesota Twin Cities ajay@cs.umn.edu Fatih Porikli Mitsubishi Electric Research Laboratories fatih@merl.com Nikolaos Papanikolopoulos

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

Online Classification on a Budget

Online Classification on a Budget Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

Oracle8i Spatial: Experiences with Extensible Databases

Oracle8i Spatial: Experiences with Extensible Databases Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction

More information

QuickDB Yet YetAnother Database Management System?

QuickDB Yet YetAnother Database Management System? QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

On Mining Group Patterns of Mobile Users

On Mining Group Patterns of Mobile Users On Mining Group Patterns of Mobile Users Yida Wang 1, Ee-Peng Lim 1, and San-Yih Hwang 2 1 Centre for Advanced Information Systems, School of Computer Engineering Nanyang Technological University, Singapore

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

Florida International University - University of Miami TRECVID 2014

Florida International University - University of Miami TRECVID 2014 Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING = + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information

EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA

EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA Mr. V. Vivekanandan Computer Science and Engineering, SriGuru Institute of Technology, Coimbatore, Tamilnadu, India. Abstract Big data

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

More information

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Enhancing Quality of Data using Data Mining Method

Enhancing Quality of Data using Data Mining Method JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Large Scale Learning to Rank

Large Scale Learning to Rank Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

More information