International Journal of Electronics and Computer Science Engineering 1634 Available Online at www.ijecse.org ISSN- 2277-1956 Providing Inferential Capability to Natural Language Database Interface Harjit Singh Assistant Professor: Department of Computer Science Punjabi University Akali Phoola Singh Neighbourhood Campus, Dehla Seehan (Sangrur), Punjab, India Email: hjit@live.com Abstract- Not everybody is able to write SQL (Structured Query Language) queries as they may not be aware of the structure of the database. So there is a need for non-expert users to query relational databases in their natural language. The idea of using natural language instead of SQL, has promoted the development of Natural Language Interface to Database systems (NLIDB). The traditional Information Retrieval Models were based on approximation and lexical mapping which had its own deficiencies. System is inadequate if the query uses hypernyms (broad category words). If a user is using synonyms of lexicon then the system is unable to access the database. Homonymous keyword in query may arise ambiguity and possibly produce erroneous result because the system is unable to distinguish actual meaning of homonyms. In a query, lexemes may be related to each other and produce a collated meaning which is not considered by the classical system. To overcome these limitations, a knowledgebase can be provided to the NLIDB. The knowledgebase will provide inferential capability to the systems using a collection of hypernyms, synonyms, homonyms, discourse and other information required to produce accurate results. Keywords NLIDB, NLI, hypernyms, synonyms, homonyms, discourse I. INTRODUCTION Asking questions to databases in natural language is very convenient and easy method of data access, especially for casual users who do not understand complex database query language such as SQL. Although number of efforts has been done by researchers to provide intelligence to Natural Language Interface to Database (NLIDB), they are not complete. Some of these efforts include: The system LUNAR was introduced in 1971. The system uses an Augmented Transition Network (ATN) parser and Woods' procedural Semantics. The system performance was quite impressive; it managed to handle 78% of requests without any errors and this ratio rose to 90% when dictionary errors were corrected. But these figures may be misleading because the system was not subject to intensive use due to the limitation of its linguistic capabilities. The LADDER system was designed as a natural language interface to a database of information about US Navy ships. The system uses semantic grammars technique that interleaves syntactic and semantic processing. The system was able to process a database that is equivalent to a relational database with 14 tables and 100 attributes. The RENDEZVOUS system appeared in late seventies. In this, users could access databases via relatively unrestricted natural language. In this system, special emphasis is placed on query paraphrasing and in engaging users in clarification dialogs when there is difficulty in parsing user input. The PLANES system was developed in late seventies at the University of Illinois Coordinated Science Laboratory. PLANES include an English language front end with the ability to understand and explicitly answer user requests. It carries out clarifying dialogues with the user as well as answer vague or poorly defined questions. The PHILIQA system was developed in 1977 and was known as Philips Question Answering System, uses a syntactic parser which runs as a separate pass from the semantic understanding passes. This system is mainly involved with problems of semantics and has three separate layers of semantic understanding. The system CHAT-80 is one of the most referenced NLP (Natural Language Processing) systems in the eighties. The database of CHAT-80 consists of facts (i. e. oceans, major seas, major rivers and major cities) about 150 of the
Providing Inferential Capability to Natural Language Database Interface countries world and a small set of English language vocabulary that are enough for querying the database. The CHAT-80 system processes an English language question in three stages. The system TEAM was developed in 1987. A large part of the research earch of that time was devoted to portability issues. TEAM was designed to be easily configurable by database administrators with no knowledge of NLIDBs. The system DATALOG is an English database query system based on Cascaded ATN grammar. By providing separate representation schemes for linguistic knowledge, general world knowledge, and application domain knowledge, DATALOG achieves a high degree of portability and extendibility. NALIX (Natural Language Interface for an XML Database) is an NLIDB system developed at the University of Michigan in 2006. The database used for this system is extensible markup language (XML) database with Schema- Free XQuery as the database query language. NALIX is different from the general syntax based approaches; in the way the system was built: NALIX implements a reversed-engineering engineering technique by building the system from a query language toward the sentences. Indeed all these efforts are not worthless although they are not complete. The system fails to understand and execute the queries containing hypernyms, synonyms, homonyms, discourse etc., so next section of this paper is trying to give the method for overcoming the mentioned inadequacies. II. PROVIDING INFERENTIAL CAPABILITY TO NLIDB The conventional NLI system is not capable to understand hypernyms, synonyms, homonyms and discourse. This section will discuss how the results are affected when these words are used in a query and propose a method to overcome the deficiencies of the system. A. Hypernyms Hypernym is a word that is more generic than a given word. A linguistic term for a word whose meaning includes the meanings of other words. If a query contains a hypernym of a specific word, the classical NLI system will be inadequate due to the non-availability of the knowledge of hypernyms. For example: If NLI Query is: Number of students doing graduation The conventional NLI does not have the knowledge that the word graduation is a hypernym of BA, BSc, Bcom, BCA, BBA etc. (Figure 4). So, it will fail to produce the required result. To overcome the limitation, a database of hypernyms needs to be embedded in the NLI system. Graduation BA BSc BCom BCA BBA Figure 1. Graduation is a hypernym for BA, BSc, BCom, BCA, BBA etc. B. Synonyms Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. If synonyms of lexicon are used in a query, the classical Information Retrieval methods will be unable to answer the query due to the non-availability of database of synonyms. For example: 1639
If a NLI query is: Number of employees whose salary is more than 25000 The equivalent SQL query will be something like: SELECT * FROM EMPLOYEES WHERE SALARY>=25000 The query will produce the result correctly. But if the user uses a synonym and use a NLI query as: Number of employees whose pay is more than 25000 The equivalent SQL query will be something like: SELECT * FROM EMPLOYEES WHERE PAY>=25000 IJECSE,Volume1,Number 3 Harjit Singh et al. In this case, query will not produce the correct result because NLI does not know that Salary and Pay are two words with the same meaning. To make the NLI efficient to answer such type of queries which may use synonyms, the database of synonyms needs to be embedded in NLI System. In this way, if a lexicon or keyword does not match, the system will not refuse to answer the query. Instead, it will traverse through all the synonyms of that unmatched keyword provided in the database of synonyms and replace the unmatched keyword with its synonym to answer the query successfully (Figure 1). Pay Salary Income Remuneration SELECT * FROM EMPLOYEES WHERE SALARY>=25000 Figure 2. Replacing an unmatched keyword with its synonym C. Homonyms In linguistics, a homonym is, in the strict sense, one of a group of words that share the same spelling and the same pronunciation but have different meanings. The state of being a homonym is called homonymy. If a query contains a homonymous keyword, it raises ambiguity in the meaning of the query. There is a possibility of producing the erroneous result because classical NLI is unable to get the actual meaning of homonym in the query. For example: If NLI query is: How many singers like the rock? The word rock has two meanings: 1) The solid mineral material forming part of the surface of the earth. A large stone. 2) A type of music (a shortened form of rock and roll) The classical NLI does not know the actual meaning of rock in the query. To overcome the limitation, a knowledgebase of homonyms need to be embedded in the NLI system. Since the word rock can be used in two different contexts, the knowledgebase will provide the actual meaning of the word by relating it to the other words in the query (Figure 2 and Figure 3).
Providing Inferential Capability to Natural Language Database Interface Rock Type of Music Related to Singer Figure 3. Rock is related to Singer Rock Is A Large Stone Composed Of Solid Mineral Material Figure 4. Rock is a large stone D. Discourse Discourse is the power of the mind to reason or infer by running, as it were, from one fact or reason to another, and deriving a conclusion; an exercise or act of this power; reasoning; range of reasoning faculty. If a query contains a keyword that refers to another keyword, there is the possibility of failure to produce the result because classical NLI does not have the outer world knowledge. For example: If NLI query is: Show the age of John and his father? Classical NLI cannot conclude the fact due to the non-availability of reasoning capability. If a knowledgebase of pronouns, anaphora and noun-phrases (NP) is embedded in NLI system, it will provide intelligence to the system to make a conclusion in such situations and result from this type of query can be achieved. The Discourse Representation Structure (DRS) in Figure-5 shows the discourse in the above query. ({x, y}, {x=john, His(x), Robert(y), Father(y, x)}) x, y
IJECSE,Volume1,Number 3 Harjit Singh et al. x= John His(x) Robert(y) Father(y, x) Figure 5. DRS of John and his Father So in this query, his refers to John and y is the Father of x and y is Robert. After this conclusion, the query will become: Show the age of John and Robert Which can be easily processed and produce the required result. III. CONCLUSION Conventional NLI model suffered from inadequacies which allow only static format queries to be executed by the system. It puts more burdens on the user to formulate queries that the system can answer successfully. To make NLI system user friendly, a knowledgebase is embedded in it to overcome its deficiency of realizing hypernyms, synonyms, homonyms and discourse. This inferential capability will make the NLI system intelligent and enable it to execute open domain based queries. It will be an intelligent NLIDB. REFERENCES [1] Majdi Owda, Zuhair Bandar, Keeley Crockett, Conversation-Based Natural Language Interface to Relational Databases, pages 363-367 (2007). [2] Mrs. Neelu Nihalani 1, Dr. Sanjay Silakari 2, Dr. Mahesh Motwani, Natural language Interface for Database: A Brief review, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, pages 600-608, March 2011 [3] Siddiqui, Tanveer and Tiwary, U.S., Natural Language processing and Information Retrieval, Oxford University press (2008). [4] Albert Visser, Discourse Representation by Hypergraphs, Doctoraalscriptie voor de studie Cognitieve Kunstmatige Intelligentie Sander Bruggink, November 6, 2001 [5] Jha, Girish Nath, A Natural Language Interface for Databases, Dept. of Linguistics, University of Illinois, Urbana-Champaign [6] www.thefreedictionary.com/rock [7] en.wikipedia.org/wiki/synonym [8] http://grammar.about.com/od/fh/g/hypernym.htm [9] http://en.wikipedia.org/wiki/homonym [10] www.definitions.net/definition/discourse [11] Woods, W. An experimental parsing system for transition network grammars. In Natural language Processing, R. Rustin, Ed.,Algorithmic Press, New York. (1973) [12] Woods, W., Kaplan, R. and Webber, B. The Lunar Sciences Natural Language Information System. Bolt Beranek and Newman Inc., Cambridge, Massachusetts Final Report. B. B. N. Report No 2378. (1972) [13] Hendrix, G. The LIFER manual A guide to building practical natural language interfaces. SRI Artificial Intelligence Center, Menlo Park, Calif. Tech. Note 138. (1977) [14] Hendrix, G., Sacrdoti, E., Sagalowicz, D. and Slocum, J. Developing a natural language interface to complex data. ACM Transactions on Database Systems, Volume 3, No. 2, USA, Pages 105 147 (1978) [15] D.L. Waltz., An English Language Question Answering System for a Large Relational Database, Communications of the ACM, 21(7):, pp 526 539 (July 1978) [16] R.J.H., Scha., Philips Question Answering System PHILIQA1, In SIGART Newsletter, no.61. ACM, New York, (February 1977) [17] Amble, T. BusTUC A Natural Language Bus Route Oracle. 6th Applied Natural Language Processing Conference, Seattle, Washington, USA (2000) [18] Warren, D., Pereira, F. An efficient and easily adaptable system for interpreting natural language queries in Computational Linguistics. Volume 8 pages 3 4. (1982)
Providing Inferential Capability to Natural Language Database Interface [19] B.J. Grosz, TEAM: A Transportable Natural-Language Interface System, In Proceedings of the 1st Conference on Applied Natural Language Processing, Santa Monica, California, pp 39 45, (1983) [20] B.J. Grosz, D.E. Appelt, P.A. Martin, and F.C.N. Pereira, TEAM: An Experiment in the Design of TransportableNatural-Language Interfaces, Artificial Intelligence, 32:, pp 173 243, ( 1987) [21] C.D. Hafner, Interaction of Knowledge Sources in a Portable Natural Language Interface, In Proceedings of the 22nd Annual Meeting of ACL, Stanford, California, pp 57 60, (1984) [22] Yunyao Li, Huahai Yang, and H.V. Jagadish, Nalix:an Interactive Natural Language Interface for Querying XML, SIGMOD (2005). [23] Yunyao Li, Huahai Yang, and H.V. Jagadish, Constructing a Generic Natural Language Interface for an XML Database, EDBT (2006).