Constructing a Generic Natural Language Interface for an XML Database Rohit Paravastu
Motivation Ability to communicate with a database in natural language regarded as the ultimate goal for DB query interfaces Challenges Automatically understanding Natural Language Translate this parsed natural language query into a Database query
NaLIX Deals with the challenge of translating NLQ into Xquery Dealt with Attribute name confusion Query Structure Confusion Differentiate between Return the book with the lowest price and Return the lowest price of the book
Background Keyword Searches Keywords that are expressed together in a query must match objects that are close together in the database Problem? Too blunt Abstract notion of close together
Schema-Free Xquery A function called meaningful query focus used to retrieve the relation between two keywords in the search Example: Return the director of Gone with the wind Gone with the wind movie
Query Translation Relations between the words to be translated into Xquery NLQ converted to a parse tree Three main steps Classification of terms in the parse tree of NLQ Validation of parse tree Translating parse tree into Xquery
Token Classification Tokens Words/phrases that match a Xquery construct or an attribute value Markers Words that don t occur in database and not a Xquery construct
Tokens and Markers
Query Translation Given a valid parse tree, identify the relations between the name tokens and translate into xquery syntax Not so straightforward
Example
Definitions Equivalent NTs: NTs with same noun phrase with same modifiers. Movie (nodes 4 and 8) in example Sub-parse tree: A subtree rooted at an operator token and has atleast two children Core Token: NT in a sub-parse tree with no descendant NTs (or) NTs equivalent to another core token Movie (nodes 4,8) and book (11)
Definitions Directly Related NTs: Parent-child relation Title and movie Related by Core Tokens: Related to same or equivalent core token Related NTs: Either of the above or related to the same NT Sets {2,4,6,8} and {9,11} in example The set of related NTs are grouped together in the same MQF
Variables Each set of equivalent name tokens assigned a variable <var> NT A variable can also be made up of a group of variables. Called composed variables
Template Matching Matching a variable or a group of variables to a given template Template gives the translation for that particular set of variables/phrases in the sentence
Templates
Aggregator Nesting If the NT attached to an aggregate function is a core token, consider the entire sentence as part of the aggregation Return the number of movies, where the director of the movie is Ron Howard Return the lowest price for each book If the NT attached to an aggregate function is not a core token, the scope of the aggregation is limited to all the directly related NTs of the attached NT Return each book with lowest price
Translation Process Parse Variable binding Nesting Scope Final Xquery
Example Output Query: Return each director, where the number of movies directed by the director is the same as the number of movies directed by Ron Howard
Interactive Query Formulation Users asked to rephrase the question if there is no valid parse tree Suggestions given to rephrase the query Given the attribute value tokens, the phrases that epitomise the relation between the attributes can be rephrased. Ambiguity in the attribute values resolved using wordnet
Experimental Evaluation Participants asked to search for a given question using keyword search or NaLIX Comparison over Ease of use Search quality Participant asked to reformulate query iteratively until an acceptable threshold of precision and recall is reached.
Experimental Evaluation Ease of use: Time taken to come up with an acceptable NLQ Search Quality: Precision and Recall of the resultant Xquery Used books data from DBLP database for evaluation
Results Ease of Use Average time of 90 seconds to form a query Less than 2 iterations per query on average Atleast one participant got the correct NLQ in the first iteration for each question
Results Search quality Average Precision of 83% and Recall 90.1% Quality affected by Quality of NLQ given by user Parser accuracy Average precision of 95.1% and Recall 97.6% for queries that are formulated and parsed correctly
Results Precision of Search results Recall of Search results
Discussion Positive points Drawbacks Is it useful for your project? Are you convinced of its usability over different datasets Any suggestions/ideas on how to make this better