Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries
|
|
|
- Virgil Reynolds
- 10 years ago
- Views:
Transcription
1 Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries Ariel Cary 1, Ouri Wolfson 2, and Naphtali Rishe 1 1 School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA {acary001,rishen}@cis.fiu.edu 2 Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA [email protected] Abstract. In this paper, we present a novel method to efficiently process top-k spatial queries with conjunctive Boolean constraints on textual content. Our method combines an R-tree with an inverted index by the inclusion of spatial references in posting lists. The result is a diskresident, dual-index data structure that is used to proactively prune the search space. R-tree nodes are visited in best-first order. A node entry is placed in the priority queue if there exists at least one object that satisfies the Boolean condition in the subtree pointed by the entry; otherwise, the subtree is not further explored. We show via extensive experimentation with real spatial databases that our method has increased performance over alternate techniques while scaling to large number of objects. Keywords: Geographic databases, spatial keyword query, scalability. 1 Introduction Today s Internet applications typically offer users the ability to associate geographical information to Web content, a process known as geotagging. For example, Wikipedia has standardized geotagging of their encyclopedia articles and images via templates [6]. Furthermore, technological advances in digital cameras and mobile phones allow users to acquire and associate geospatial coordinates, via built-in GPS devices or Wi-Fi triangulation, to media resources. Additionally, Web content can be automatically paired with geographical coordinates, for instance, exploiting content features, such as place names or street addresses, in combination with gazetteers. Thus, the powerful combination of Internet applications, GPS-enabled devices, and automatic geotagging can potentially generate large amounts of georeferenced content. On the structured end, spatial databases usually contain rich textual descriptions, stored in non-spatial attributes. For example, a database of property parcels may store property s owner name, description, and street address in addition to its coordinates. Research was partially supported by NSF DGE , IIS , and IIS M. Gertz and B. Ludäscher (Eds.): SSDBM 2010, LNCS 6187, pp , c Springer-Verlag Berlin Heidelberg 2010
2 88 A. Cary, O. Wolfson, and N. Rishe A key problem recently tackled by the academia and industry is spatial searches with text constraints in geographical collections [3] [7] [11] [9] [10]. For example, in a database of parcels, we may be interested in finding nearby houses to Miami Beach (spatial constraint) that have backyard and are located on Collins Avenue (text constraint). Typically, query keywords are assumed to be conjunctively connected. That is, records containing all query keywords are retrieved. In the general case, text constraints may involve complex combinations of keywords with logical connectives beyond the conjunctive semantics. For instance, in the database of parcels, fire fighters traveling in a truck may want to quickly determine the nearest parcels that have swimming pool and are not located in buildings for water replenishment in an emergency. As geospatial collections increase in size, the demand of efficient processing of spatial queries with text constraints becomes more prevalent. In this paper, we propose a method for efficiently processing top-k nearest neighbor queries with text constraints where keywords are combined with the three basic Boolean operators AND, OR, andnot. Our method uses an R-tree to guide the spatial search and an inverted file for text content retrieval, which are combined in a novel hybrid spatial keyword index. The specific contributions of this paper are: 1. We define a top-k spatial Boolean (k-sb) query that finds nearest neighbor objects satisfying Boolean constraints on keywords combined with conjunctive ( ), disjunctive ( ), and complement ( ) logical operators. 2. We propose a novel hybrid Spatial-Keyword Index (SKI ) to efficiently process k-sb queries. A salient feature of SKI is that it only searches subspaces that do contain objects satisfying the query Boolean predicate. 3. We execute extensive experimentation on an implementation of our method over large spatial databases. Experimental results show that the proposed method has excellent performance and scalability. Section 2 discusses related work to our research. Section 3 formally defines the problem. Section 4 presents the proposed hybrid indexing approach and query processing algorithms. Experimental study on an implementation of our hybrid index is conducted in Section 5. Section 6 presents our concluding remarks. 2 Related Work The R-tree traversal method in our work is inspired in Hjaltason and Samet s [5] incremental top-k nearest neighbor algorithm using R-trees [1]. Performance improvements on the original R-tree work have been proposed, e.g. R*-tree [13], R+-tree [14], and Hilbert R-tree [15]. Any of these variants can replace the R- tree index used in the proposed hybrid spatial keyword index without modifying our search algorithms. In information retrieval, inverted files are arguably the most efficient index structure for free-text search [2] [12]. The problem of retrieving spatial objects satisfying non-spatial constraints has been studied in the recent past. Park and Kim [10] proposed RS-trees, a combination of R-trees and signature trees for attributes with controlled cardinality;
3 Top-k Spatial Boolean Queries 89 signature chopping is suggested to mitigate combinatorial errors [8] (database overrepresentation) of superimposed signatures. Harinharan et al. [9] proposed to include a list of terms in every node of an R-tree. De Felipe et al. [11] augmented signature files in R-tree nodes with similar constraints as [10]. Recently, Cong et al. [3] augmented an inverted file in every node of an R-tree, and used a ranking function that combines spatial proximity and text relevancy. Our work differs in that we assume distance as ranking score, and we focus on efficiently processing Boolean constraints on textual data. Further, none of the previous works offer efficient processing of the complement logical operator, which limits their applicability to the k-sb queries we considered in this work. Likewise, modern Web search engines, like Google and Yahoo!, offer Local Search services. Advanced querying options are provided to include and exclude certain terms from the search result. These are similar to the k-sb queries we consider. However, specific search algorithms are kept confidential by their owning companies. 3 Problem Definition A spatial database D = {o 1,o 2,..., o N } is a set of objects such that every o D has a pair of attributes <p,t >,where:p E is a point in a metric space E with distance dist(p 1,p 2 ), and T = {t 1,t 2,...} isadocumentasasetofterms. A top-k spatial Boolean (k-sb) queryq is a triple <l,k,b >,where:l E is the query location (spatial constraint), k is the desired output size, and B is the conjunctive Boolean predicate (text constraint). B is a set of keywords prefixed with Boolean operators {,, }, conjunctively connected as follows: [ B = (A = {a 1,a 2,...}) (C = {c 1,c 2,...}) ] (G = {g 1,g 2,...}) (1) A (AND-semantics), C (OR-semantics), G (NOT-semantics) are subsets of terms prefixed with,, and, respectively. An object o D satisfies B if: [( a A : o.t a ) ( c C : o.t c ) ( g G : o.t g = )] (2) The result of the k-sb query Q is the list: L = {o i D, i =1,..., n L o i satisfies B n L k}, such that: o (D \ L) :[dist(o.p, l) arg max r L dist(r.p, l) (o satisfiesb)] (3) Objects in L are sorted by distance to l in non-decreasing order. In other words, a k-sb query Q returns the k nearest neighbor objects to the query location l that satisfy the conjunctive Boolean predicate B. In this work, we assume E is the Euclidean space. The problem is how to efficiently compute L. Example: In database D 1 of Table 1, the query Find top-10 houses nearby Miami that have masterbed with bathtub, have a pool or backyard, and are not located in a building translates to the following k-sb query: Q 1 ={Miami,10, [ (masterbed, bathtub) (pool, backyard) (building)]} and retrieves L 1 = {o 3,o 8 }.
4 90 A. Cary, O. Wolfson, and N. Rishe Table 1. Property parcel database D 1 = {o 1,o 2,..., o 12}. For every textual term (t), the list of objects containing t is shown. Term Object List Term Object List backyard (t 1) {o 2,o 3,o 6,o 8} collins (t 4) {o 2,o 6,o 10} bathtub (t 2) {o 3,o 5,o 8,o 9} masterbed (t 5) {o 3,o 8,o 11} building (t 3) {o 1,o 5,o 7,o 12} miami (t 6) {o 1,o 3,o 4,o 10} 4 Hybrid Spatial-Keyword Indexing In designing the hybrid index, we pursue the following objectives. First, we want to attain fast retrieval even when matching objects are located far away from one another. Second, we want to efficiently filter objects not satisfying the query Boolean constraints on keywords. A key challenge is to perform small number of computations to eliminate as many non-candidate objects as possible. In particular, NOT-semantics constraints may substantially shrink the output size and lead to unnecessary scans. Third, we want to maintain low storage requirements while keeping high query performance. With these objectives in mind, our indexing approach leverages the strengths of R-trees [1] in spatial search, and modifies an inverted file [2] for efficient processing of Boolean constraints. The combination of indexing techniques yields the hybrid data structure: Spatial-Keyword Index (SKI). We next introduce two important definitions in SKI. Definition 1. Given an R-tree R with fanout m, a super node s is the list of m leaf (level 1) nodes that have the same parent node. The universe of super nodes of R is S(R) =[s 1,s 2,...], wheres 1 references the left-most leaf nodes of R. Definition 2. The term bitmap of term t at super node s is a fixed length bit sequence I(t, s) of size m 2, where the i th bit is computed as follows: { 1 if s[i] points to object o : t o.t I(t, s)[i] = (4) 0 otherwise For an R-tree with L levels, a super node s contains O(m) leaf nodes, or equivalently O(m 2 ) object pointers, and S(R) = O(m (L 2) )forl>1. A single-level R-tree has no super nodes. Figure 1 shows super node s 1 of an R-tree built on D 1, and term bitmap for miami keyword at s Spatial Keyword Index The hybrid spatial keyword index (SKI) is composed of two building blocks: a) R-tree Index (R): A modified R-tree built with spatial attributes of D. Entries in R s inner nodes are augmented with index ranges [a, b], where s a and s b are the left-most and right-most, respectively, super nodes contained in the subtree rooted at node entry. Ranges in leaf-node entries contain a single value, the index of the super node containing the leaf node.
5 Top-k Spatial Boolean Queries 91 Level 2 N6... Super Node s 1 Level 1 (leaves) I("miami", s ) = 1 N1 N2 O O O O O O null Term Bitmap Fig. 1. Super node s 1 composed of leaf nodes [N 1,N 2], and term bitmap for miami b) Spatial Inverted File (SIF): A modified inverted file constructed on the vocabulary V = { o D o.t }. The Lexicon contains terms in V and their document frequencies (df ). Posting lists are modified to include spatial information from R. Specifically, the posting list of term t contains all its term bitmaps (rather than documents) sorted by super node index as follows: Posting(t) =[I(t, s 1 ),I(t, s 2 ),...] where s i S(R) (5) Efficiency Considerations. We organize posting elements in a B+tree to allow fast random and range retrieval. Keys are <t,i>pairs while values are bitmaps I(t, s i ). In order to reduce storage requirements, we compress I(t, s i )usingthe Word-Aligned Hybrid bitmap compression method (WAH )[4].WAH method allows fast bitwise computations with logical operators AN D, OR, and N OT on uncompressed bitmaps, which is capitalized during query processing. Figure 2 shows R and SIF structures for database D 1 in Table 1. Level 3 Level 2 Super node ranges N6 [1,2] R: R-tree N8 [1,1] [2,2] SIF: Spatial Inverted File... t N7 [ I(t,s ), I(t,s ) ] references Level 1 (leaves) N1 N2 null N3 N4 N5 s 1 s 2 Fig. 2. Hybrid Spatial-Keyword Index for database D 1 in Table 1
6 92 A. Cary, O. Wolfson, and N. Rishe 4.2 Processing k-sb Queries In order to process query Q =< l,k,b >,R-treeR is traversed from the root node following the best-first traversal algorithm proposed in [5]. That is, node entries are visited in order of proximity of their minimum bounding rectangles (MBR)tolocationl. A node entry e is placed in a priority queue, with priority equal to dist(mbr,l), if at least one object within e s subtree satisfies Boolean predicate B. The SIF structure is used to qualify e. A powerful feature of the previous filter is that unnecessary subtree traversals are eliminated altogether. Algorithm 1 shows the steps involved in processing k-sb queries using SKI. The algorithm starts by resorting query keywords by documents frequency (line 1) in such a way that as many object candidates as possible are eliminated with few posting list merges. For instance, infrequent terms have large number of 0s in their term bitmaps, and possibly short Posting() lists, which are adequate to be processed first for AND-semantics terms. In line 2, the priority queue, result list, and a globally accessible hash map M are initialized. M caches merged term bitmaps of previously evaluated super nodes during query execution. Next, R is traversed in best-first order starting from its root in lines 4 9. A node entry e is evaluated w.r.t. B by the function issubtreecandidate (line 8). Only when e s subtree has at least one object that satisfies B is it pushed into the queue. Algorithm 1. Process k-sb Query Input. k-sb query Q =<l,k,b > Output. A list of objects satisfying Q (see Equation 3) begin 1 Sort term subsets in B by document frequency (df ) as follows: A (AND) in ascending order, and C, G (OR, NOT ) in descending order 2 Initialize: priority Queue R.root; list L ; global hash map M 3 r 0 /* number of B-satisfying objects retrieved so far */ 4 while (Queue and r<k) do 5 Node n Queue.pop() 6 if (n is obj. pointer) then r r +1 L.add(getObject(D, n)) /* retrieve o D pointed by n */ 7 else for (every entry e in node n) do 8 if (issubtreecandidate(b,n, [e s position in n])) then 9 Queue.push(e.ptr) with priority dist(e.mbr, l) 10 return L issubtreecandidate function, described in Algorithm 2, evaluates B predicate by merging query term bitmaps on a range of super nodes, one super node at a time, until one candidate is found (lines 4 9). This processing style is similar to Document-At-A-Time processing in inverted files [2], except that postings are not exhausted. Logical bitwise operations are performed on term bitmaps
7 Top-k Spatial Boolean Queries 93 Algorithm 2. issubtreecandidate Input. B: query predicate; n: node; i: positional index Output. True if o that satisfies B within subtree at n[i], false otherwise begin 1 if (n is leaf node) then 2 if (The i-th bit in M(n[i].a) is set) then return true 3 else return false 4 else for (j n[i].a to n[i].b) do 5 pe t B.A I(t, j) /* execute bitwise operations */ 6 pe pe [ t B.C I(t, j)] /* on term bitmaps over */ 7 pe pe [ t B.G flip(i(t, j))] /* super node range in n[i] */ 8 if (cardinality(pe) > 0) then M.add(key = j, value = pe) 9 return true 10 return false (lines 5 7) according to term semantics. Complement operator requires term bitmaps to be flipped (converting 1s into 0s and vice versa), which is accomplished by the flip function (line 7). Next, if the merged bitmap has at least one bit set (line 8), meaning there is a candidate, then it is cached in M (line 8), and the function returns true. Otherwise,B is evaluated at the next super node in the range until a candidate is found, or the range is exhausted. In the latter worst case, the subtree is discarded in its entirety. Since a super node references O(m 2 ) objects, a range [a, b] can potentially filter out O(m 2 a b ) objects. The I/O cost is remarkably only O( B log( V ) a b ), where B is the number of query terms, V the vocabulary size, and log( V ) thecostofterm bitmap retrievals from a B+tree (see Section 4.1). 5 Experiments We conducted a series of querying experiments with three real spatial datasets explained in Table 2. Records contain geographical coordinates, and between 30 and 80 text attributes (concatenated in a term set). SKI was implemented in Java, and experiments ran on an Intel Xeon E GHz machine with 8GB of RAM. We measured average number of random I/Os and response times in processing k-sb queries and compared performance with two baselines: Baseline 1 (IFC). An inverted file containing object coordinates in addition to object pointers. Queries are processed in two phases. First, term postings are merged according to B semantics. Second, satisfying objects are sorted by proximity to query location. The top-k objects in the sorted list are returned. Baseline 2 (RIF). An R-tree with every node augmented with an inverted file on keywords within its subtree. This baseline is inspired by arts [3] [9]. At query
8 94 A. Cary, O. Wolfson, and N. Rishe time, R-tree nodes are visited in best-first order w.r.t. spatial attributes. B is evaluated with the inverted file at every node, except for NOT-semantics terms. Workload. Every vocabulary was sorted by document frequency (df ) and divided in three quantiles: S: Termswithdf < 1 quantile (infrequent terms), M: Terms with df < 2 quantile, L: Termswithdf < 3 quantile (entire vocabulary). In each quantile, k-sb queries were composed by randomly picking between 3 and 8termsandprefixingthemwith{,, } operators to form B. k was fixed to SKI RIF IFC 10 1 SKI RIF IFC YP(L) RD(L) YP(L) RD(L) FL(L) YP(M) RD(M) FL(M) YP(S) RD(S) FL(S) a) Average random I/O requests FL(L) YP(M) RD(M) FL(M) YP(S) RD(S) FL(S) b) Average elapsed time (seconds) Fig. 3. Performance metrics on 50 k-sb query runs. Y-axis is in logarithmic scale. Figure 3 shows the average number of I/Os andelapsedtimeover50k-sb queries of each workload type {S, M, L} for every dataset. IFC shows performance advantage when query terms are relatively infrequent (S). Short posting lists can be quickly evaluated to compute query result, whereas SKI and RIF spend additional R-tree traversals. When query terms become more frequent (M and L), IFC incurrs in expensive long posting list merges, which is observed in peaks of Figure 3.a for L queries. RIF performs acceptably for S queries but degrades for M and L queries. This may be due to filtering limitations in R- tree upper levels. Eventually, subtrees known (via inverted file) to contain query terms are traversed. However, terms may belong to different objects, i.e. no single object satisfies B predicate. In the same vein, RIF must wait until objects are retrieved to apply NOT-semantics filters, which can also degrade its performance. In summary, we observed consistent enhanced retrieval performance using the proposed hybrid indexing and query processing methods. Table 2. Experimental spatial datasets. Dataset and vocabulary sizes are in millions. D D V Subject FL Property parcels in the Florida state. YP Yellow pages of businesses in the United States. RD Road segments in the United States.
9 Top-k Spatial Boolean Queries 95 6 Conclusions In this paper, we proposed a disk-resident hybrid index for efficiently answering k-nn queries with Boolean constraints on textual content. We combined modified versions of R-trees and inverted files to achieve effective pruning of the search space. Our experimental study showed increased performance and scalability on large, 10M and 20M sized, spatial datasets over alternate methods. Acknowledgment. This research was supported in part by NSF grants IIS , CNS , HRD , IIP , IIP , DGE , IIS , and IIS References 1. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD, pp ACM, New York (1984) 2. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006) 3. Cong, G., Jensen, C.S., Wu, D.: Efficient retrieval of the top-k most relevant spatial web objects. Proc. VLDB Endow. 2(1), (2009) 4. Wu, K., Otoo, E.J., Shoshani, A.: Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst. 31(1), 1 38 (2006) 5. Hjaltason, G., Samet, H.: Distance browsing in spatial databases. ACM Trans. Database Syst. 24(2), (1999) 6. WikiProject on geographical coordinates, Wikipedia:WikiProject_Geographical_coordinates 7. Xin, D., Han, J.: P-Cube: Answering preference queries in multi-dimensional space. In: ICDE, pp IEEE Computer Society, Los Alamitos (2008) 8. Chang, W.W., Schek, H.J.: A signature access method for the Starburst database system. In: VLDB, pp (1989) 9. Hariharan, R., Hore, B., Li, C., Mehrotra, S.: Processing spatial-keyword (SK) queries in geographic information retrieval (GIR) systems. In: SSDBM, p. 16 (2007) 10. Park, D.J., Kim, H.J.: An enhanced technique for k-nearest neighbor queries with non-spatial selection predicates. Multimedia Tools and Apps., (2004) 11. De Felipe, I., Hristidis, V., Rishe, N.: Keyword search on spatial databases. In: ICDE, pp IEEE Computer Society, Los Alamitos (2008) 12. Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23(4), (1998) 13. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: An efficient and robust access method for points and rectangles. In: SIGMOD, pp (1990) 14. Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R+-tree: A dynamic index for multi-dimensional objects. In: VLDB, pp (1987) 15. Kamel, I., Faloutsos, C.: Hilbert R-tree: An improved R-tree using fractals. In: VLDB, pp (1994)
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&
International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants
R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions
Processing Spatial-Keyword (SK) Queries in Geographic Information Retrieval (GIR) Systems
Processing Spatial-Keyword (SK) Queries in Geographic Information Retrieval (GIR) Systems Ramaswamy Hariharan, Bijit Hore, Chen Li, Sharad Mehrotra Donald Bren School of Information and Computer Sciences
QuickDB Yet YetAnother Database Management System?
QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,
Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments
Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments Yaokai Feng a, Akifumi Makinouchi b a Faculty of Information Science and Electrical Engineering, Kyushu University,
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn [email protected] Reimar Schröter [email protected] David Broneske [email protected]
Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
Efficient Parallel Block-Max WAND Algorithm
Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Caching Dynamic Skyline Queries
Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management of Information Systems R.C. Athena Outline Introduction
Big Data and Scripting. Part 4: Memory Hierarchies
1, Big Data and Scripting Part 4: Memory Hierarchies 2, Model and Definitions memory size: M machine words total storage (on disk) of N elements (N is very large) disk size unlimited (for our considerations)
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Predictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 [email protected] John Langford Yahoo! Research New York, NY 10018 [email protected] Alex Strehl Yahoo! Research New York,
Storage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. [email protected] Alistair Moffat Department of Computer Science
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse
Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Zainab Qays Abdulhadi* * Ministry of Higher Education & Scientific Research Baghdad, Iraq Zhang Zuping Hamed Ibrahim Housien**
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Inverted Indexes: Trading Precision for Efficiency
Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids
Index Terms Domain name, Firewall, Packet, Phishing, URL.
BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 [email protected],
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
Speed Performance Improvement of Vehicle Blob Tracking System
Speed Performance Improvement of Vehicle Blob Tracking System Sung Chun Lee and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA [email protected], [email protected] Abstract. A speed
CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions
CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly
Indexing the Trajectories of Moving Objects in Networks
Indexing the Trajectories of Moving Objects in Networks Victor Teixeira de Almeida Ralf Hartmut Güting Praktische Informatik IV Fernuniversität Hagen, D-5884 Hagen, Germany {victor.almeida, rhg}@fernuni-hagen.de
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India [email protected]
Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search
Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.
Cluster Analysis for Optimal Indexing
Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.
Binary Heap Algorithms
CS Data Structures and Algorithms Lecture Slides Wednesday, April 5, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected] 2005 2009 Glenn G. Chappell
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
XML Data Integration Based on Content and Structure Similarity Using Keys
XML Data Integration Based on Content and Structure Similarity Using Keys Waraporn Viyanon 1, Sanjay K. Madria 1, and Sourav S. Bhowmick 2 1 Department of Computer Science, Missouri University of Science
Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles
B-Trees Algorithms and data structures for external memory as opposed to the main memory B-Trees Previous Lectures Height balanced binary search trees: AVL trees, red-black trees. Multiway search trees:
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
Caching XML Data on Mobile Web Clients
Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
MUTI-KEYWORD SEARCH WITH PRESERVING PRIVACY OVER ENCRYPTED DATA IN THE CLOUD
MUTI-KEYWORD SEARCH WITH PRESERVING PRIVACY OVER ENCRYPTED DATA IN THE CLOUD A.Shanthi 1, M. Purushotham Reddy 2, G.Rama Subba Reddy 3 1 M.tech Scholar (CSE), 2 Asst.professor, Dept. of CSE, Vignana Bharathi
Colored Range Searching on Internal Memory
Colored Range Searching on Internal Memory Haritha Bellam, Saladi Rahul, and Krishnan Rajan Lab for Spatial Informatics, IIIT-Hyderabad, Hyderabad, India Univerity of Minnesota, Minneapolis, MN, USA Abstract.
Mining Mobile Group Patterns: A Trajectory-Based Approach
Mining Mobile Group Patterns: A Trajectory-Based Approach San-Yih Hwang, Ying-Han Liu, Jeng-Kuen Chiu, and Ee-Peng Lim Department of Information Management National Sun Yat-Sen University, Kaohsiung, Taiwan
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Databases and Information Systems 1 Part 3: Storage Structures and Indices
bases and Information Systems 1 Part 3: Storage Structures and Indices Prof. Dr. Stefan Böttcher Fakultät EIM, Institut für Informatik Universität Paderborn WS 2009 / 2010 Contents: - database buffer -
An Efficient Multi-Keyword Ranked Secure Search On Crypto Drive With Privacy Retaining
An Efficient Multi-Keyword Ranked Secure Search On Crypto Drive With Privacy Retaining 1 B.Sahaya Emelda and 2 Mrs. P. Maria Jesi M.E.,Ph.D., 1 PG Student and 2 Associate Professor, Department of Computer
D B M G Data Base and Data Mining Group of Politecnico di Torino
Database Management Data Base and Data Mining Group of [email protected] A.A. 2014-2015 Optimizer objective A SQL statement can be executed in many different ways The query optimizer determines
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries
Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries Kale Sarika Prakash 1, P. M. Joe Prathap 2 1 Research Scholar, Department of Computer Science and Engineering, St. Peters
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Fuzzy Duplicate Detection on XML Data
Fuzzy Duplicate Detection on XML Data Melanie Weis Humboldt-Universität zu Berlin Unter den Linden 6, Berlin, Germany [email protected] Abstract XML is popular for data exchange and data publishing
Sorting Hierarchical Data in External Memory for Archiving
Sorting Hierarchical Data in External Memory for Archiving Ioannis Koltsidas School of Informatics University of Edinburgh [email protected] Heiko Müller School of Informatics University of Edinburgh
External Memory Geometric Data Structures
External Memory Geometric Data Structures Lars Arge Department of Computer Science University of Aarhus and Duke University Augues 24, 2005 1 Introduction Many modern applications store and process datasets
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
DATABASE DESIGN - 1DL400
DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Efficient Processing of Joins on Set-valued Attributes
Efficient Processing of Joins on Set-valued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong [email protected] Abstract Object-oriented
Efficiently Identifying Inclusion Dependencies in RDBMS
Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany [email protected]
Chapter 2 The Information Retrieval Process
Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense
Data Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: [email protected] 1 Introduction The promise of decision support systems is to exploit enterprise
External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
Merkle Hash Trees for Distributed Audit Logs
Merkle Hash Trees for Distributed Audit Logs Subject proposed by Karthikeyan Bhargavan [email protected] April 7, 2015 Modern distributed systems spread their databases across a large number
Performance Evaluation of Main-Memory R-tree Variants
Performance Evaluation of Main-Memory R-tree Variants Sangyong Hwang 1, Keunjoo Kwon 1, Sang K. Cha 1, Byung S. Lee 2 1. Seoul National University {syhwang, icdi, chask}@kdb.snu.ac.kr 2. University of
Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets
Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets Pramod S. Reader, Information Technology, M.P.Christian College of Engineering, Bhilai,C.G. INDIA.
Secure and Faster NN Queries on Outsourced Metric Data Assets Renuka Bandi #1, Madhu Babu Ch *2
Secure and Faster NN Queries on Outsourced Metric Data Assets Renuka Bandi #1, Madhu Babu Ch *2 #1 M.Tech, CSE, BVRIT, Hyderabad, Andhra Pradesh, India # Professor, Department of CSE, BVRIT, Hyderabad,
