Data Extraction from Structured Databases using Keyword-based Queries

Transcription

1 paper:84 Data Extraction from Structured Databases using Keyword-based Queries Mariana Soller Ramada, João Carlos da Silva, Plínio de Sá Leitão-Júnior 1 Instituto de Informática Universidade Federal de Goiás (UFG) Caixa Postal Goiânia GO Brazil {mariana,jcs,plinio}@inf.ufg.br Abstract. Relational databases are used to store a large quantity of data scattered around the world. However, users face difficulties in accessing such data for lack of a more natural way of specifying queries. Techniques that use natural language words to search different information sources on the Web are now very common, but they cannot be employed to search relational databases. This work proposes a method that allows user to submit keyword-based queries, which are then semantically analysed and enriched before being mapped for the database language. By analysing keyword-based queries, the method considers different factors, e.g. the proximity between keywords, query segmentation, and the use of aggregate functions. 1. Introduction The use of keywords for retrieving information consists of a simple search technique. The last decade has witnessed the growing use of this technique, which has in fact become a standard for user interaction with the World Wide Web (WWW). However, it cannot be applied to all storage media, like relational databases, for instance, which store a vast amount of valuable information. Querying in relational databases requires prior knowledge of storage structures and of the syntax of a structured language, such as SQL. However, the majority of users do not have such knowledge, which limits access to the stored data. In the last few years, great efforts have been made in research and development activities to extend the abilities of keyword-based search to data sources that follow the relational paradigm. Nevertheless, the existing techniques reveal some drawbacks. The first drawback is that, according to such approaches, every keyword plays a given role in the database and each keyword is mapped out for a corresponding database structure. Consider the schema of the relation Employee (Id, Name, Address, Salary, Super id, Department id). The query higher salary expects a statistic, the higher salary, as a result, not a set of interconnected tuples containing the keywords higher and salary. The second drawback is that the query is segmented in a way that each keyword represents a single role in the database. Referring once again to the relation Employee, a possible interpretation for the query employee Houston Tx is the employee who lives at the address Houston Tx. Therefore, the keywords Houston and Tx are expected to be mapped together for the attribute Address of the table Employee, instead of being mapped separately for a database structure. The third drawback concerns the fact that several studies fail to consider the interdependence between keywords. Even though a query is made up of a simple list of keywords, the meaning of each keyword is not independent from the 57

2 meaning of the others; together, they all represent the concepts intended by the user when creating a query. This paper focuses on the semantic approach to keyword queries. The drawbacks listed above are considered and solutions are provided and implemented. A new keywordbased search method for relational databases was defined. For a given keyword query, the method proposed converts it into corresponding SQL queries, all of which are submitted to the underlying database. An SQL query is a structured query in which tables, attributes, and their conditions are accurately specified, whereas a keyword query comprises imprecise terms that express the user s need for information. Therefore, this study introduces semantics to this conversion process, to provide a clearer idea of the meaning intended by the query and to construct SQL expressions that represent the user s real intent, returning results in order of relevance. The Keymantic-based approach [Bergamaschi et al. 2010] was chosen as a starting point for meeting the aims of this study, in view of the fact that it deals partially with the third drawback previously mentioned. The remainder of this paper is organized as follows. Section 2 presents related works. Section 3 shows the architecture of the keyword query method proposed, as well as the way it operates. Section 4 explains how the Keymantic system works, and Sections 5 and 6 reveal the modifications proposed and their impacts. Section 7 provides some conclusions. 2. Related Works The literature reveals two main approaches to keyword queries in relational databases: one based on Candidate Networks and another based on Steiner Trees. Both conceive a database as a network of interconnected tuples and focus on detecting tuples that contain the keywords of a given query. Query processing returns connected components based on the way these tuples are associated. DBXplorer [Agrawal et al. 2002] and DIS- COVER [Hristidis and Papakonstantinou 2002] implement the Candidate Networks approach, whereas BANKS [Aditya et al. 2002] applies the Steiner Tree approach. All of these systems pose the three drawbacks mentioned in Section 1. To solve the second drawback, FRISK system [Pu and Yu 2009] uses a dynamic programming algorithm to compute the query s best segmentations and then present them to the user, who in turn chooses the one that better suits his/her intent. Keymantic [Bergamaschi et al. 2010] and Keyword++ [Ganti et al. 2010] solve the third drawback by taking into account the query s ambiguity and seeking the completeness and accuracy of results. Completeness consists of returning all relevant results, whereas accuracy refers to the most relevant results which match user intent. A query may return more than one result, in which case it becomes necessary to order them according to their relevance. Ranking functions assign scores to each result and then classify results according to these scores. Some researchers have employed Data Retrieval metrics for calculating ranking functions, such as Luo et al. [Luo et al. 2008] and Hristidis et al. [Hristidis et al. 2003]. In DISCOVER [Hristidis and Papakonstantinou 2002] and DBXplorer [Agrawal et al. 2002], results are ranked by simple methods, e.g. based on the number of joins. In Labrador [Mesquita et al. 2007] a ranking is computed using a Bayesian network model. As for 58

3 BANKS [Aditya et al. 2002], calculating result scores takes into account the edge weight of the data graph. 3. Architecture The method proposed converts a keyword query into corresponding SQL queries, which are submitted to the underlying database. As a result to a keyword query, the method returns the results obtained by running the SQL queries, listed in order of relevance. Figure 1 shows an overview of the method s architecture. Figure 1. Architecture of the method proposed 3.1. Preprocessing This stage is responsible for identifying keywords in the query that do not provide a direct meaning in the database structure, but other forms of semantics, e.g. the use of aggregate functions and data sorting. In addition, this stage determine s the query s best segmentation regarding the values comprising more than one keyword Verification of Aggregate Function/Sorting This stage allows the identification of keywords which represent the intention of using aggregate functions and sorting. Identifying keywords which suggest the use of aggregate functions and sorting is performed through a list of reserved words, which are basically predefined words based on the way users produce their queries. Seven groups were created and a set of keywords was defined for each of them. Each set consists of synonyms obtained by integrating a thesaurus as a semantic resource. Listed below are the groups and their respective sets: maximum = synonym(higher) minimum = synonym(lower) mean = synonym(average) sum = synonym(total) count = synonym(quantity) grouping = synonym(for each) order = synonym(sorted) 59

4 After identifying which keyword suggests the use of an aggregate function or sorting, it is necessary to establish which attribute the function will be applied on. Based on the fact that users create queries in which related words are close to one another, once a keyword that suggests the use of an aggregate function or sorting is identified, the function it suggests is applied to the term represented by the keyword that immediately follows it Query Segmentation In general, search engines support the use of delimiters to group multiple words to a single concept. Many concepts are represented by a phrase rather than by a single word. Google and Yahoo! are examples of search engines that allow syntax for phrase searching. Identifying keywords which must be regarded collectively to form an attribute value is performed by the use of single inverted commas Query Processing Once a keyword query has been submitted, it is necessary to create several SQL queries that will be run in the underlying database. This process is carried out by mapping the keywords for the database terms. Once the SQL queries have been run, results are returned to the user in order of relevance Mapping Returning a keyword query in relational databases requires understanding the meaning of each keyword and the construction of an SQL query that provides a coherent interpretation of the original query. It is necessary to map the keywords for the database structures, e.g. relations, attributes, and attribute values. Several techniques and tools have been proposed to solve the problem of keyword queries in relational databases, as was pointed out in Section 2. However, most of these proposals fail to consider the many interpretations which a keyword query may pose. The mapping process proposed in this paper is based on the semantic analysis implemented by Keymantic, which explores the relative positions (order) of keywords within the query, as well as the database schema and other auxiliary external sources. The Keymantic approach, which grounds the present work, is described in Section Execution and Ranking of Results Each SQL query generated in the previous stage is now run in the corresponding database, and the results of these queries represent the results expected for a keyword query. The results produced are not equally significant, as some of them represent the semantics intended by the query more effectively. In this regard, it is interesting to generate a ranking of results. Considering that the Keymantic system computes a score for each generated SQL query, the results returned to the user may then be listed firstly by the score of their corresponding SQL query, and secondly by the size of the join path, based on the number of joins required to create the SQL query. 60

5 4. Semantic Query Analysis Keymantic explores the relative positions of query keywords together with external sources, to produce a more accurate assumption of the semantics represented by the query. This statement is grounded on the fact that the meaning of each keyword is not independent from the meaning of the others; all of them collectively represent the concepts the user had in mind when creating the query. Moreover, not all keywords represent instance values. Many are used as metadata of adjacent keywords. The mapping process performed by Keymantic comprises five stages. A special data structure, known as weight matrix, is used during this process. The value of a cell represents the weight related to the mapping performed between keyword and database term. Two sub-matrices may be distinguished in the weight matrix. The first, called SW, corresponds to the database terms related to schema elements, i.e. relations and attributes. The second, called V W, corresponds to attribute values, i.e. elements which belong to attribute domains. The first step is Intrinsic Weight Computation. The relevance between each query keyword and each database term is calculated by exploring and combining a number of similarity techniques. In the next step, Selection of Best Mappings to Schema Terms, a serie of mappings is generated based on the intrinsic weights of sub-matrix SW. Each mapping associates a certain number of keywords to the database schema terms. Keywords that remain unmapped are considered at a later stage during value term mapping. Only mappings that reach the highest score are selected. In the third step, Contextualization of VW and Selection of Best Mappings to Value Terms, the unmapped keywords are now mapped based on each partial mapping generated in the previous stage. Then a total mapping of keywords for the database terms form a configuration at step Generation of the Configurations. The configuration s score is the sum of weights in the weight matrix of elements [i, j], where i is a keyword and j is the database term to which the keyword was mapped. Finally, once the best configurations have been computed, the interpretations of the keyword query, i.e. SQL queries, may be generated at step Generation of the Interpretations. The score of each SQL query is that of its respective configuration. A configuration is simply a mapping of the keywords to database terms. The presence of different join paths between these terms leads to multiple interpretations. 5. Contributions to Semantic Analysis Keymantic is grounded on the assumption that every keyword plays a role in a query, i.e. each keyword represents a term within the database. As each keyword must represent a database term, it is not possible to interpret queries which suggest the use of aggregate functions or sorting. Even if a query has, for instance, the keyword higher, the system will map it to a database term instead of interpreting it as an indicator of the use of the aggregate function max. Additionally, Keymantic regards the mapping process as an injective function, in which there is no image element that shows correspondence with more than one domain element, the domain being the set of keywords and the image, the set of database terms. In other words, two keywords cannot be mapped to the same database term. However, as mentioned in Section 1, it is possible for a value to be composed by more than one 61

6 keyword. In this sense, it is necessary to map all the keywords which compose this value for the same attribute domain. Even though Keymantic computes weights which refer to each interpretation, it does not provide a classification of the obtained results. The paper which describes this tool suggests that a classification based on each interpretation s score and on the size of the join path may be carried out. To make feasible the mapping of query keywords for aggregate function, sorting or composite values, as well as to allow sorting results, the method proposed in this paper implements the functionalities which refer to the modules Verification of Aggregate Function/Sorting, Query Segmentation, and Ranking of Results shown in Figure 1, all of which are external to Keymantic. In addition, some modifications within Keymantic, particularly during Mapping, were also performed to yield greater quality to the returned results. Such modifications regard the stages Intrinsic Weight Computation of Schema Database Terms, Intrinsic Weight Computation of Value Database Terms, and the contextualization process. Use of Synonyms as Keywords During Intrinsic Weight Computation of Schema Database Terms, Keymantic employs a series of techniques that measure similarity between keywords and database terms, selecting the one which produces the best result. One of these techniques is string similarity. For string similarity Keymantic employs a number of different similarity metrics such as Jaccard, Hamming, Levenshtein, etc. The tool also assesses the relationship between a given keyword and a database schema term based on their semantic relationship, and for that uses ontologies and dictionaries. For each measuring technique (string similarity, semantic relationship, etc.), the similarity between a keyword and a database term is calculated in a 0-1 interval. The highest value returned is then multiplied by 100 and selected as intrinsic weight. If none of the values returned is higher than a predefined threshold, then the weight is set to 0. The method here proposed regards, at first, only the string similarity technique. Three similarity metrics are used for this technique: Jaccard, Levenshtein, and Cosine. Each metric has two strings as input and returns the similarity between them in a 0-1 interval. Computing the intrinsic weight of a keyword and a database term is based on the average of similarities returned for each metric. If the average is greater than the threshold, it is multiplied by 100 and selected as intrinsic weight. If not, the similarity between the synonyms of the keyword and the database term will be calculated. That involves using the WordNet 1 dictionary to return the synonyms of each query keyword. The average of similarities returned for each metric is computed for each synonym found. The greater average value returned will be taken into account. If this value is greater than the threshold, the average is multiplied by 90 and selected as intrinsic weight. If it is lower than the threshold, then the intrinsic weight is set to 0. Because users normally expect keywords from the query to appear in the results, the reason for considering the use of a thesaurus at a later stage is that it is necessary to give greater relevance to keyword queries. In other words, the existence of the keyword in the database must have greater weight than the existence of its synonym

7 Weight normalization for sub-matrix V W The modification applied to the step Intrinsic Weight Computation of Value Database Terms is performed because computing value term weights is regarded in binary form. Therefore, whereas the weights of sub-matrix SW - which refer to schema terms - are found in a interval, the weights of sub-matrix V W are 0 or 1. Thus, given that the weights of matrix V W are binary, the value attributed to the total configuration score by the keywords mapped to value terms is minimum. That being said, it is necessary to compare the value of value term intrinsic weights with that of schema term weights. Hence, after calculating the intrinsic weights for the value terms in the method proposed, weights that have value 1 are multiplied by a predefined constant. Proximity between keywords The last modification performed within Keymantic took place during contextualization. Keymantic takes into account the interdependence between query keywords, in which mapping a given keyword may increase or reduce the probability that another keyword, as yet unmapped, corresponds to a given database term. In addition to considering the interdependences between keywords, the method proposed takes into account their relative positions in the query based on their proximity. The farther a mapped keyword is from an unmapped one, the smaller its influence over this word s mapping process. This fact led to the following modification during contextualization: instead of adding a constant to intrinsic weights, the value added to the weight is proportional to the distance between keywords. 6. Comparison with Keymantic This section describes the impacts caused to the set of results presented to the user after applying modifications to the method proposed in Keymantic. Internal modifications were performed to attribute further semantics to the mapping process, in order to generate a smaller but more relevant set of interpretations in view of user intent. Precision and MRR metrics were used for this analysis. Using a company database, both systems were implemented and compared based on the results returned by each after running a set of keyword queries. Table 1 shows the set of keyword queries selected from a query pool and its user intended semantics. Table 1. Set of keyword queries Keyword Queries Intended Semantics 1 department List of information regarding the company s departments 2 department employee List of each department s employees 3 department project List of each department s projects 4 department project newbenefits Information of the department responsible for the Newbenefits project 5 employee dependent daughter Information regarding employees who have a daughter as a dependent 6 project name employee John Name of the project for which employee John works 7 employee address project 1 Address of employee who works in project 1 8 hours employee works project productz Number of working hours of each employee in project ProductZ 9 project location employee salary Location of project for which the employee who earns a salary works 63

8 Figure 2 compares and constrasts the number of configurations and interpretations (SQL queries) generated by both systems. Values are shown for each of the queries, whose identification is the same as that used in Table 1. Internal modifications, performed during the first three stages of Keymantic s mapping process, influenced the results obtained by our method, leading to lower values compare with Keymantic as regards the number of configurations and interpretations. This reduction has a major impact on the cost of the process and allows the user to deal with less results for his/her query. (a) Number of configurations generated by both systems (b) Number of interpretations generated by both systems Figure 2. Number of configurations and interpretations generated by both systems The graphs in Figure 3 explore queries with three or more keywords, as well as present each system s relative accuracy. This metric reveals the relevant fraction of the obtained interpretations, hence its status as a quality indicator commonly described in the literature. As shows Subfigure 3(a), the method proposed was more effective, showing that despite the reduction in the number of interpretations generated, the resulting set is more relevant to the query. Subfigure 3(b) exhibits accuracy values which refer to query size, i.e. the number of keywords comprising the query. Accuracy decreases in both systems as the number of query keywords increases. However, results reveal, once more, better accuracy values when compared with the original method. (a) (b) Figure 3. Accuracy metric for both systems As regards the external stage Ranking of Results, Keymantic is known to show results according to the order in which they are generated, with no guarantee that they will be presented in a descending score order. 64

9 The metric Mean Reciprocal Rank (MRR) was used to assess the stage Ranking of Results. This metric measures how near the top the first relevant result ranks within the set of results. Figure 4 shows the MRR for both systems based on query length. In Keymantic, the longer the query, the farther from the top of the ranking appears the first relevant result. In the method proposed, even though the first result becomes distant from the top as the query becomes longer, it remains one of the top-ranked results. Figure 4. MRR for both systems (inspired in [Fakhraee and Fotouhi 2012]) Threats to Validity Three limitations were identified with regard to the results obtained with the proposed method. The first limitation is related to the cardinality of the set of queries from which the results were obtained. A more extensive set of queries could lead to a higher confidence level when comparing results with Keymantic. The second limitation is related to how the queries were obtained, which are designed for the analysis of the proposed method instead of queries obtained from real systems. The last limitation is related to the fact that the proposed method was compared solely with Keymantic. A stronger argument could be made in favor of the proposed approach if its performance were to be compared with other existing state-of-the-art tools that allow searching relational databases using keyword queries. 7. Conclusion This paper proposed a method for querying relational databases with keywords to simplify access to these data, given the fact that such queries use natural language words. This method considered factors such as query segmentation, aggregate functions and sorting, as well as the user s intended semantics when creating the query. The semantic analysis of keyword queries was based on the approach provided by the Keymantic tool. Other resources were added to Keymantic s original proposal, such as the possibility of dealing with queries using aggregate functions and values comprising more than one keyword. To offer new querying possibilities, the method proposed implemented new external functionalities and performed internal improvements to the tool. During the experiments, it became clear that the internal modifications promoted a smaller and more significant set of results for the keyword query submitted, whereas the external modifications allowed the specification of queries that had up until then not 65

10 been considered, such as those that employed aggregate functions, sorting, and values comprising more than one keyword. Along the course of this research, we identified some aspects that could be complemented on to further discussion on this topic. Among them is the use of ontologies. During the process of intrinsic weight computation of schema terms, Keymantic uses ontologies, in addition to the synonyms obtained by WordNet, hence adding greater semantics to these stages. Another aspect to be considered is more effective query segmentation. In the method proposed, composite values are identified via the use of single inverted commas, which requires the user s a priori knowledge to construct the query in a suitable way. References Aditya, B., Bhalotia, G., Chakrabarti, S., Hulgeri, A., Nakhe, C., Parag, and Sudarshanxe, S. (2002). Banks: Browsing and keyword searching in relational databases. In VLDB 02: Proceedings of the 28th Intl. Conference on Very Large Databases, pages Morgan Kaufmann, San Francisco. Agrawal, S., Chaudhuri, S., and Das, G. (2002). Dbxplorer: a system for keyword-based search over relational databases. In Data Engineering, Proceedings. 18th Intl. Conference on, pages Bergamaschi, S., Domnori, E., Guerra, F., Orsini, M., Lado, R. T., and Velegrakis, Y. (2010). Keymantic: semantic keyword-based searching in data integration systems. Proc. VLDB Endow., 3: Fakhraee, S. and Fotouhi, F. (2012). Dbsemsxplorer: semantic-based keyword search system over relational databases for knowledge discovery. In Proceedings of the Third Intl. Workshop on Keyword Search on Structured Data, KEYS 12, pages 54 62, New York, NY, USA. ACM. Ganti, V., He, Y., and Xin, D. (2010). Keyword++: a framework to improve keyword search over entity databases. Proc. VLDB Endow., 3: Hristidis, V., Gravano, L., and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over relational databases. In Proceedings of the 29th Intl. conference on Very large data bases - Volume 29, VLDB 2003, pages Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In Bernstein, P. A., Ioannidis, Y. E., Ramakrishnan, R., and Papadias, D., editors, VLDB 02: Proceedings of the 28th Intl. Conference on Very Large Databases, pages Morgan Kaufmann, San Francisco. Luo, Y., Wang, W., and Lin, X. (2008). Spark: A keyword search engine on relational databases. In Data Engineering, ICDE IEEE 24th Intl. Conference on, pages Mesquita, F., da Silva, A. S., de Moura, E. S., Calado, P., and Laender, A. H. F. (2007). Labrador: Efficiently publishing relational databases on the web by using keywordbased query interfaces. Inf. Process. Manage., 43(4): Pu, K. and Yu, X. (2009). Frisk: Keyword query cleaning and processing in action. In Data Engineering, ICDE 09. IEEE 25th Intl. Conference on, pages