A Survey of Natural Language Interface to Database Management System

A Survey of Natural Language Interface to Database Management System B.Sujatha Dr.S.Viswanadha Raju Humera Shaziya Research Scholar Professor & Head Lecturer in Computers Dept. of CSE Dept. of CSE Dept. of M.C.A J.N.T.University J.N.T.University Nizam College Hyderabad Jagtial Hyderabad sujatha_tha2003@yahoo.com humera.shaziya@gmail.com Abstract- Information is playing an important role in our lives. One of the major sources of information is databases. Databases and database technology are having major impact on the growing use of computers and Internet. Almost all Information technology (IT) applications are storing and retrieving information from databases to retrieve information from database requires knowledge of database languages such as SQL. The Structured Query Language (SQL) norms are been pursued in almost all languages for relational database management systems. However, not everybody is able to write SQL queries as they may not be aware of the structure of the database. So this has led to the development of Natural Language interface for databases. There is an overwhelming need for non-sophisticated users to query relational databases in their natural language instead of working with the syntax of SQL. As a result many natural language interfaces to databases have been developed, which provides different options for manipulating queries. The idea of using Natural Language instead of SQL has prompted the development of new type of processing called Natural language Interface to Database. Natural Language Interface for Database is a step towards the development of an intelligent database interface to ease the task of user in accessing databases. This system is an implementation of Intelligent Natural Language interface to Database Systems and a simplified interface to users. Keywords- Natural Language, interface, DBMS, NL Processing System, NLIDB systems I. INTRODUCTION A natural language interface to a database is a system that allows the user to access information stored in a database by typing requests expressed in some natural language (e.g. English and Telugu). Throughout this paper messages printed by a Natural language interface database is shown in this typeface; user entries are shown in this typeface. The purpose of this paper is to serve as an introduction of research in the area of natural language interfaces to databases. This paper is by no means a complete discussion of all the issues that are relevant to Natural language interface databases. The reader is also referred to for alternative surveys of the same field. The area of natural language research is still very experimental and systems so far have been limited to small domains, where only certain types of sentences can be used. When systems are scaled-up to cover larger domains, natural language becomes very difficult due to the natural ambiguity in spoken sentences, and the vast amount of information that needs to be incorporated in order to disambiguate such sentences. For example, the sentence: The woman saw the man on the hill with the telescope. could have many different meanings. To understand what the intended meaning is, we have to take into account the current context, such as the woman is a witness, and any background information, such as there is a hill nearby with a telescope on it. Alternatively the man could be on the hill, and the woman may be looking through the telescope. All this information is very difficult to represent in the computer, so restricting the domain of an natural language system is the only practical way to get a manageable subset of English to work with. One area in which natural language systems are powerful enough to be effective is database query systems. Databases usually cover a small enough domain so that an English question about the data within it can be easily analyzed by a natural language system. The database can be consulted and an appropriate response can be generated. Much research is still going on in natural language database interfaces. Areas such as error reporting, when a sentence cannot be analyzed successfully, new approaches to natural language systems, and automatic generation of database natural language systems are still important unresolved issues. However, the standard approach to database natural language systems is well established. This approach creates a semantic grammar for each database and uses this to parse the English question. The semantic grammar creates a representation of the semantics, or meaning, of the sentence. II. BACKGROUND There are three areas of research associated with this study; natural language in general, natural language as an interface to databases and Information Technology System. Natural Language Processing System Natural language processing is the technology for dealing with our most ubiquitous product: human language, as it 56

appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties. In the past decade, successful natural language processing applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email spam detection to automatic question answering, from detecting people's opinions about products or services to extracting appointments from your email. In this class, you'll learn the fundamental algorithms and mathematical models for human language processing and how you can use them to solve practical problems in dealing with language data wherever you encounter it. Research in natural language processing has been going on for several decades dating back to the late 1940s. Machine translation (MT) was the first computer-based application related to natural language. While Weaver and Booth started one of the earliest MT projects in 1946 on computer translation based on expertise in breaking enemy codes during World War II, it was generally agreed that it was Weaver s memorandum of 1949 that brought the idea of MT to general notice and inspired many projects. He suggested using ideas from cryptography and information theory for language translation. Research began at various research institutions in the United States within a few years. Early work in MT took the simplistic view that the only differences between languages resided in their vocabularies and the permitted word orders. Systems developed from this perspective simply used dictionarylookup for appropriate words for translation and reordered the words after translation to fit the word-order rules of the target language, without taking into account the lexical ambiguity inherent in natural language. This produced poor results. The apparent failure made researchers realize that the task was a lot harder than anticipated, and they needed a more adequate theory of language. However, it was not until 1957 when Chomsky published Syntactic Structures introducing the idea of generative grammar, did the field gain better insight into whether or how mainstream linguistics could help MT. During this period, other NLP application areas began to emerge, such as speech recognition. The language processing community and the speech community then was split into two camps with the language processing community dominated by the theoretical perspective of generative grammar and hostile to statistical methods, and the speech community dominated by statistical information theory (5) and hostile to theoretical linguistics (6). Due to the developments of the syntactic theory of language and parsing algorithms, there was overenthusiasm in the 1950s that people believed that fully automatic high quality translation systems (2) would be able to produce results indistinguishable from those of human translators, and such systems should be in operation within a few years. It was not only unrealistic given the then-available linguistic knowledge and computer systems, but also impossible in principle (3). The inadequacies of then-existing systems, and perhaps accompanied by the overenthusiasm, led to the ALPAC (Automatic Language Processing Advisory Committee of the National Academy of Science - National Research Council) report of 1966. (7) The report concluded that MT was not immediately achievable and recommended it not be funded. This had the effect of halting MT and most work in other applications of NLP at least within the United States. Although there was a substantial decrease in NLP work during the years after the ALPAC report, there were some significant developments, both in theoretical issues and in construction of prototype systems. Theoretical work in the late 1960 s and early 1970 s focused on the issue of how to represent meaning and developing computationally tractable solutions that the then-existing theories of grammar were not able to produce. In 1965, Chomsky (8) introduced the transformational model of linguistic competence. However, the transformational generative grammars were too syntactically oriented to allow for semantic concerns. They also did not lend themselves easily to computational implementation. As a reaction to Chomsky s theories and the work of other transformational generativists, case grammar of Fillmore, (9), semantic networks of Quillian, (10), and conceptual dependency theory of Schank, (11) were developed to explain syntactic anomalies, and provide semantic representations. Augmented transition networks of Woods, (12) extended the power of phrase-structure grammar by incorporating mechanisms from programming languages such as LISP. Other representation formalisms included Wilks preference semantics (13), and Kay s functional grammar (14). III. MOTIVATION Research has been very active in developing interfaces for accessing structured knowledge, from faceted search, where knowledge is grouped and represented through taxonomies, to menu guided and form-based interfaces such as those oared by the Knowledge and Information Management (KIM) platform [Popov et al., 2003]. These interfaces still require that the user is familiar with the queried knowledge structure. However, casual users need to be able to access the data despite their queries not matching exactly the queried data structures. According to the interface evaluation systems developed to support Natural Language Interfaces are perceived as the most acceptable by end-users. This conclusion is drawn from a usability study, which compared four types of query language interfaces to knowledge bases and involved 52 users of general background. Web users are used to typing primitive questions into the text box of a search engine. Search engines like Google are capable of answering simple questions like what is the capital of Serbia. However, the power of Linked Data is in the capability to answer more complex queries for which the answer 57

cannot be found through Google. Also, much data on the Web is accessible through the use of applications based on relational databases. IV. HISTORY OF NLIDB The very first attempts at NLP database interface are just a sold as any other NLP research. Asking questions to databases in natural language is very convenient and easy method of data access, especially for casual users who do not understand complex database query language such as SQL. Here are some examples of the Natural Language Interface to Database systems A. Lunar System The system LUNAR is a system which answers questions about samples of rocks brought back from the moon. The system was informally introduced in 1971. To accomplish its function the LUNAR system uses two databases; one for the chemical analysis and the other for literature references. The LUNAR system uses an Augmented Transition Network (ATN) parser and Woods' procedural Semantics. The LUNAR system performance was quite impressive; it managed to handle 78% of requests without any errors and this ratio rose to 90% when dictionary errors were corrected. But these figures may be misleading because the system was not subject to intensive use due to the limitation of its linguistic capabilities. B. Ladder The LADDER system was designed as a natural language interface to a database of information about US Navy ships. The LADDER system uses semantic grammar to parse questions to query a distributed database. The system uses semantic grammars technique that interleaves syntactic and semantic processing. The question answering is done via parsing the input and mapping the parse tree to a database query. The system LADDER is based on a three layered architecture. The first component of the system is for Informal Natural Language Access to Navy Data (INLAND), which accepts questions in a natural language and produces a query to the database. The queries from the INLAND are directed to the Intelligent Data Access (IDA), which is the second component of LADDER. The INLAND component builds a fragment of a query to IDA for each lower level syntactic unit in the English language input query and these fragments are then combined to higher level syntactic units to be recognized. At the sentence level, the combined fragments are sent as a command to IDA. IDA would compose an answer that is relevant to the user s original query in addition to planning the correct sequence of file queries. The third component of the LADDER system is for File Access Manager (FAM).The task of FAM is to find the location of the generic files and manage the access to them in the distributed database. The system LADDER was implemented in LISP. At the time of the creation of the LADDER system was able to process a database that is equivalent to a relational database with 14 tables and 100 attributes. C. Rendezvous System This system appeared in late seventies. In this, users could access databases via relatively unrestricted natural language. In this Codd s system, special emphasis is placed on query paraphrasing and in engaging users in clarification dialogs when there is difficulty in parsing user input. D. Planes This was developed in late seventies for (Programmed LANguage-based Enquiry System) at the University of Illinois Coordinated Science Laboratory. PLANES include an English language front end with the ability to understand and explicitly answer user requests. It carries out clarifying dialogues with the user as well as answer vague or poorly defined questions. This work is being carried out using database based upon information of the U.S. Navy 3-M (Maintenance and Material Management), it is a database of aircraft maintenance and flight data, although the ideas can be directly applied to other nonhierarchic record-based databases. E. Philiqa This was developed in 1977 and was known as Philips Question Answering System, uses a syntactic parser which runs as a separate pass from the semantic understanding passes. This system is mainly involved with problems of semantics and has three separate layers of semantic understanding. The layers are called "English Formal Language", "World Model Language", and "Data Base Language" and appear to correspond roughly to the "external", "conceptual", and "internal" views of data. F. Chat-80 The system CHAT-80 is one of the most referenced NLP systems in the eighties. The system was implemented in Prolog. The CHAT-80 was an impressive, efficient and sophisticated system. The database of CHAT-80 consists of facts (i. e. oceans, major seas, major rivers and major cities) about 150 of the countries world and a small set of English language vocabulary that are enough for querying the database. The CHAT-80 system processes an English language question in three stages as depicted. G. Team It was developed in 1987. A large part of the research of that time was devoted to portability issues. TEAM was designed to be easily configurable by database administrators with no knowledge of NLIDBs. H. Ask This system developed in 1983, allowed end-users to teach the system new words and concepts at any point during the interaction. ASK was actually a complete information management system, providing its own builtin database and the ability to interact with multiple 58

external databases, electronic mail programs and other computer applications. All the applications connected to ASK were accessible to the end user through natural language requests. The user stated his/her requests in English and Ask transparently generated suitable requests to the appropriate underlying systems. I. Janus It had similar abilities to interface to multiple underlying systems (databases, expert systems, graphics devices, etc). All the underlying systems could participate in the evaluation of a natural language request, without the user ever becoming aware of the heterogeneity of the overall system. JANUS is also one of the few systems to support temporal questions J. Eufid The EUFID system consists of three major modules, not counting the DBMS. First is analyzer module, second is mapper module and third is translator module. K. Datalog It is an English database query system based on Cascaded ATN grammar. By providing separate representation schemes for linguistic knowledge, general world knowledge, and application domain knowledge, DATALOG achieves a high degree of portability and extendibility. V. RECENT DEVELOPMENTS IN NLIDB This section provides a brief overview of three specific NLIDB systems developed recently in different universities. A. Nalix NALIX (Natural Language Interface for an XML Database) is an NLIDB system developed at the University of Michigan, Ann Arbor by Yunyao Li, Huahai Yang, and H. V. Jagadish (2006). The database used for this system is extensible markup language (XML) database with Schema- Free XQuery as the database query language. Schema-Free XQuery is a query language designed mainly for retrieving information in XML. The idea is to use keyword search for databases. However, pure keyword search certainly cannot be applied. Therefore, some richer query mechanisms are added. Given a collection of keywords, each keyword has several candidate XML elements to relate. All of these candidates are added to MQF (Meaningful Query Focus), which will automatically find all the relations between these elements. The main advantage of Schema-Free Xquery is that it is not necessary to map a query into the exact database schema, since it will automatically find all the relations given certain keywords. NALIX can be classified as a syntax based system, since the transformation processes are done in three steps: generating a parse tree, validating the parse tree, and translating the parse tree to an XQuery expression. NALIX is different from the general syntax based approaches; in the way the system was built: NALIX implements a reversed-engineering technique by building the system from a query language toward the sentences. B. Precise Precise is a system developed at the University of Washington by Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates (2004). The target database is in the form of a relational database using SQL as the query language. It introduces the idea of semantically tractable sentences which are sentences that can be translated to a unique semantic interpretation by analyzing some lexicons and semantic constraints. PRECISE was evaluated on two database domains. The first one is the ATIS domain, which consists of spoken questions about air travel, their written forms, and their correct translations in SQL query language. In ATIS domain, 95.8% of the questions were semantically tractable. Using these questions gives PRECISE 94% precision. The second domain is the GEOQUERY domain. This domain contains information about U.S. Geography. 77.5% of the questions in GEOQUERY are semantically tractable. Using these questions gives PRECISE 100% accuracy. The strength of PRECISE is based on the ability to match keywords in a sentence to the corresponding database structures. This process is done in two stages, first by narrowing the possibilities using Maxflow algorithm and second by analyzing the syntactic structure of a sentence. Therefore PRECISE is able to perform impressively in semantically tractable questions. As other NLIDB systems, PRECISE has its own weaknesses. While it is able to achieve high accuracy in semantically tractable questions, the system compensates for the gain in accuracy at the cost of recall. Another problem is as PRECISE adopts a heuristic based approach; the system suffers from the problem of handling nested structures. C. Wasp Word Alignment-based Semantic Parsing (WASP) is a system developed at the University of Texas, Austin by Yuk Wah Wong. While the system is designed to address the broader goal of constructing a complete, formal, symbolic, meaningful representation of a natural language sentence, it can also be applied to the NLIDB domain. A predicate logic (Prolog) was used as the formal query language. WASP learns to build a semantic parser given a corpus a set of natural language sentences annotated with their correct formal query languages. It requires no priorknowledge of the syntax, because the whole learning process is done using statistical machine translation techniques. WASP was evaluated on the GEOQUERY domain, the same domain as PRECISE. GEO- QUERY corpus consists of 880 questions in the training set and 250 questions in the test set, which are merged together into one larger data set. Each data set was divided to 10 equal-sized subsets, and standard 10-fold cross validation was used to estimate the system performance. WASP 59

achieved 86.14% precision and 75.00% recall in the GEOQUERY domain. The system was also evaluated on a variety of other natural languages: English, Spanish, Japanese and Turkish. There were no significant differences observed between English and Spanish, but the Japanese corpus has the lowest precision and the Turkish corpus has the lowest recall. The strength of WASP comes from the ability to build a semantic parser from annotated corpora. This approach is beneficial because it uses statistical machine translation with minimal supervision. Therefore, the system does not have to manually develop a grammar in different domains. Moreover, while most of NLIDB systems use English as their natural language, WASP has been tested on several languages. In spite of the strength, WASP also has two weaknesses. The first is: the system is based solely on the analysis of a sentence and its possible query translation, and the database part is therefore left untouched. There is a lot of information that can be extracted from a database, such as the lexical notation, the structure, and the relations within. Not using this knowledge prevents WASP to achieve better performances. The second problem is that the system requires a large amount of annotated corpora before it can be used, and building such corpora requires a large amount of work. VI. CONCLUSION This paper has attempted to serve two purposes: to introduce the reader to the field of Nlidbs by describing some of the central issues, and to indicate the current situation in this area by outlining the facilities and methods of typical implemented systems. The goal of surveying the field can be achieved only incompletely at any given moment. Research is done from the last few decades on Natural Language Interfaces. With the advancement in hardware processing power, many NLIDBs mentioned in historical background got promising results. Though several NLIDB systems have also been developed so far for commercial use but the use of NLIDB systems is not wide-spread and it is not a standard option for interfacing to a database. This lack of acceptance is mainly due to the large number of deficiencies in the NLIDB system in order to understand a natural language. REFERENCES [1] Anil K Jain, Fellow, IEEE, Robert P.W Duin, and Jianchang Mao, Statistical Pattern Recognition : A Review, IEEE Transactions on pattern analysis and machine intelligence, vol 22, No 1, January 2000. [2] Edith Buchholz, Heiko Cyriaks, Antje Diisterhoft, Holger Mehlan, Bernhard Thalheim, Applying a Natural Language Dialogue Tool for Designing Databases. [3] Paul S Jacobs, Generation in a Natural Language Interface, Division of Computer Science, Department of EECS, University of California, Berkeley, CA, USA. [4] Frank Meng and Wesley W Chu, Database Query Formation from Natural Language using semantic Modeling and Statistical Keyword Meaning Disambiguation, Computer Science Department, University of California, Los Angeles, CA, 900095 USA. [5] Gary Adams and Philip Resnik, A Language Identification Application Built on the Java Client Server Platform Sun Microsystem Laboratories, Two Elizabeth Drive, Chelmsford, MA 10824-4195, USA, Department of Linguistic and UMIACS, University of Maryland, College Park, MD 20742, USA. [6] Majdi Owda, Zuhair Bandar, Keeley Crockett, Conversation Based Natural Language Interface to Relational Databases The intelligence system groups, Department of Computing and Mathematics, The Manchester Metropolitan University, Chester, Street, Manchester, M1, 5GD, UK, 2007. [7] Ana-Maria Popescu, Oren Etzioni, and Henry Kautz, Towards a Theory of Natural Language Interfaces to Databases University of Washington Computer Science Seattle, WA 98195, USA. [8] S. E. MICHOS, N. FAKOTAKIS AND G. KOKKINAKIS, Towards an adaptive natural language interface to command languages, University of Patras, Greece, Recei.ed 26 January 1996; re.ised : 29 July 1996). [9] ANDROUTSOPOULOS, G.D. RITCHIE Department of Artificial Intelligence, University of Edinburgh 80 South Bridge, Edinburgh EH1 1HN, Scotland, UK, P. THANISCH, Department of Computer Science, University of EdinburghKing's Buildings, May field Road, Edinburgh EH9 3JZ, Scotland, UK, Natural language interfaces to databases an Introduction, (Received 25 June 1994; revised 10 December 1994). [10] Afzal Ballim and Vincenzo Pallotta,Department of Computer Science Swiss Federal Institute of Technology Lausanne, Special Issue on Robust Methods in Analysisof Natural Language Data Natural Language Engineering 7 (2): 189{190. Printed in the United Kingdom C 2001 Cambridge University Press [11] A.V. Krishna Prasad, S Rama Krishna, Data Mining for secure software Engineering Source code management Tool Case Study, International Journal of Engineering Science and Technology vol2(7), 2010,2667-2667. [12] http://hlt.fbk.ed/en/technology [13] http:// www.oracle. Org/ naaron/iipc/language-detectioninvestigation.html [14]http:// developer.cybozu.co.jp/oss/2010/10/language-detect.html. 60