Blaise Bulamambo. BSc Computer Science 2007/2008

Transcription

1 Concert Life Database for Natural Language Blaise Bulamambo BSc Computer Science 2007/2008 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of Student)

2 Summary The aim of this project is to design a system that will allow users to query the database using natural language. A concertdatabase was built to achieve this aim and some extensions were also added to the built system. ii

3 Acknowledgements I would like to thank this country the United Kingdom especially the government for giving me this opportunity to continue my studies. I also thank you My Yvonne my special friend for all support she has given me since the beginning of my academic studies. I also thank my supervisor Eric Atwell for all the advice and support during this project, and last but not least I thank my tutor Dr Martyn Clark for all the encouragements he offered from the beginning of my University life. iii

4 Contents 1. Introduction Project Background Project Objectives and Minimum Requirements Project Challenges Project Schedule Background Research Design Methodologies The Traditional Life Cycle Rapid Application Development (RAD) Rational Unified Process (RUP).. 4 Choice of Methodology Natural Language Processing MontyLingua Natural Language Toolkit (NLTK) General Architecture for Text Engineering (GATE) Choice of Natural Language Processing Toolkit Querying with Natural Language Building Natural Language System Syntax-based Systems Semantic Grammar Systems Database Model Relational Database Flat-File Database Requirements Analysis Methods Used to Gather Requirements Gathering Requirements Functional Requirements System Design System Architecture Database Design Flat-File Format...19 iv

5 4.3 Parser Keyword Extraction User Interface Design Implementation and Testing Database Implementation Natural Language Toolkit Choice of Programming Language User Interface Implementation Parser Implementation Keyword Extraction Database Search Evaluation Meeting the Functional Requirements Meeting Possible Extensions Conclusion.32 References 33 Appendices...35 Appendix A: Personal Reflection 35 Appendix B: Project Schedule...36 v

6 1. Introduction 1.1 Project Background The aim of this project is to contribute to the development of a corpus texts and associated mark-up capturing expert analysis, documenting the rise of the music concert industry in the nineteenth-century London. Researchers in the School of Music have already undertaken a pilot project to collect a database of records documenting concert life in the nineteenth-century London. The source documents included posters, newspaper adverts and reviews, concert programmes etc; these were keyed or scanned into an Oracle database. This project will focus on porting the data from Oracle database to open-source corpus toolkit such as the Natural Language Toolkit (NLTK), and then build a natural language interface that will allow users to query the database using natural language, such as English. 1.2 Project Objective and Minimum Requirements The source documents collected for the 19 th century concerts are stored in full-fledged relational database. To access the data stored in this database an artificial structured query language such as SQL is required. The main objective of this project is to create a natural language processing system that will allow users to query the database using natural language. The system will accept a user query in natural language, process the query using natural language processing tools, then search the database and output the results. In order to achieve the above objectives the following minimum requirements and possible extensions which should only be included if time permits need to be satisfied: 1. Minimum requirements: To convert the data stored in the Oracle Concert Database into a different format and to access the converted data using the open-source corpus toolkit. To search the converted data using natural language. 1

7 To implement an input/output interface to allow user to enter natural language queries and display results. 2. Possible Extensions: 1. An extension to the User Interface to include features such as: When user enters words that are not covered by the grammar, a feed back message is displayed showing the invalid words. A Help command, when selected should display all the available help commands such as, type %display_grammar to view grammar, type %exit to close the system, etc 2. A graphical User Interface (GUI) to that includes all the features above. 1.3 Project Challenges The main challenge encountered during this project was that access to the Oracle database was not permitted due to integrity and security of the stored data. Instead, some files, which contained data from the database, were provided in an excel format. The files provided contained many fields which had numbers, alphanumeric data, compound names and many empty cells. It required a significant amount of time to try to understand the data that each file contained and then to manually modify the files so that they could be used properly to perform some tasks on them. 1.4 Project Schedule The Gantt chart in appendix B shows the details of how the project was managed. The project was divided into small manageable tasks. In additional to that, milestones were set to help control the progress of each phase. Most of the first semester was spent on understanding the project requirements and doing background research. In the second semester some tasks had to be rescheduled due to the time taken to design the system. It took a large amount time to design and implement the system due to the author inexperience with the python programming language. As a result, the writingup and draft chapter were postponed until the design and implementation phase was at least half completed. 2

8 2. Background Research 2.1 Design Methodologies A design methodology refers to an organised collection of procedures, techniques, tools and documentation aids that must be followed in order to control the process of developing an information system. Various design methodologies have been created, each with its own strengths and weaknesses. This section defines couple of methodologies that could be applied to this project, and select the most appropriate for this project The Traditional Life Cycle The Traditional Life Cycle, often referred to as the Waterfall model, outlines the series of steps that should occur when building an information system. These steps usually occur in a predefined order with a review at the end of each stage before the next can be started. Although there are many variants, the basic structure of the waterfall model is illustrated in Figure 1 [1]. Figure 1: Traditional waterfall life cycle model. By dividing the development of a system into an orderly sequence of phases, and each subdivided into more manageable tasks, control of the applications development process is assured. Criticisms of the Traditional Life Cycle are that, it is inflexible, slow, costly and cumbersome due to significant structure and tight control; the outputs that the system is meant to produce are usually decided very early in the development process, meaning that changes in the users requirements can only be made at the end of the project [2]. 3

9 2.1.2 Rapid Application Development (RAD) The underlying objective of the RAD is to produce high quality systems quickly through the use of iterative Prototyping. Prototyping is an iterative process where users suggest modifications before further prototyping and the final information system are built. RAD is seen as a possible solution to the problems and pressures of the Traditional Life Cycle. The Rapid Application Development is mostly applied to projects in which requirements are not fully known. Despite the proposed advantages of the RAD approach its speedy processes and lower cost may lead to lower overall system quality [1, 2] Rational Unified Process (RUP) The Rational Unified Process is an iterative process which advocates an increasing understanding of the problem through successive refinements and an incremental growth of an effective solution over multiple cycles. It incorporates the flexibility to accommodate new requirements or tactical changes to business objectives. Risks are usually identified or resolved sooner rather than later [4]. The Rational Unified Process consists of the following four phases: 1. Inception: Identify the scope and initial plan of the project. 2. Elaboration: Capture the requirements and design the system architecture. 3. Construction: Build the first operational system version. 4. Transition: An almost risks free system to deliver to the end users. Within each phase are a series of iterations. An iteration represents a complete development cycle that results in an executable release which grows from iteration to iteration to become the final system [4]. The Rational Unified Process methodology is most appropriate for large projects where requirements are not well understood or changing due to external changes, changing expectations, budget changes or rapidly changing technology [5]. 4

10 2.1.4 Choice of Methodology In order to identify which methodology is appropriate for this project certain characteristic had to be considered. The following are the characteristics of this project: The School of Music have provided documents which will be used to gather the requirements needed for this project [5]. Once gathered, these requirements are likely to stay stable during the system development life cycle. Therefore, there will be no need for further investigation or iterations to discover new requirements. The project objectives and deliverables are stated from the start of the project. The system developer will arrange meeting with the users for further inquiries about the project if needs be. This project involves only one person who is not fully experienced and has other commitments. Since the human resources is restricted to only one person, and also the project requirements and the objectives are known from the start, this suggests that the Waterfall Model with some flexibility would be the appropriate option for this project. The Rational Unified Process and the Rapid Application Development methodologies are mostly used when requirements are not well understood from the start and are likely to change during the system development life cycle; therefore, the need for more iterations as users requirements come to light. The orderly sequence of the Waterfall Model phases allows strict control of the project and ensures progress of the system development process. 2.2 Natural Language Processing Natural language processing (NLP) can be described as an attempt to automatically manipulate and analyse natural or human languages using computers. It encompasses tasks such as natural text processing, speech processing and many more. There exist many natural language processing tools that can be used to automatically process natural languages, each of them with their own strengths and weaknesses. Many of these tools inherit techniques largely from Linguistics and Artificial Intelligence, and are also influenced by new areas such as Machine Learning, Computational Statistics 5

11 and Cognitive Science [13]. The next section explores some of the tools used in natural language processing tasks MontyLingua MontyLingua is a freeware natural language processing tool developed in python programming language. It is an entire suite of individual tools that can be used to process natural language text ranging from raw text to the semantic interpretation of that text. It is more of an end-to-end natural language processing, and will work straight out-of-the-box. Entering an English sentence into MontyLigua, the output will be a semantic interpretation of that sentence, as shown in figure 2. Figure 2: MontyLingua interface [19 ] Some of the tasks that MontyLingua can perform are outline below: MontyTokenizer splits raw English sentence into its constituent tokens. MontyTagger Part-of-speech tagging enriched with common sense. MontyChunker Lightning fast regular expression chunker MontyExtractor Extract phrases and subject/verb/object triplets from sentences [19]. 6

12 2.2.2 Natural Language Toolkit Similar to MontyLingua, Natural Language Toolkit is open-source software written in python programming language. It is a collection of modules and corpora that allow students to learn and conduct research in Natural Language Processing. NLTK includes programming libraries that can be used to write programs that can be use to perform some natural language tasks. Some tasks that can be performed using NLTK libraries are listed below: Tokenization: the processing of splitting a sentence into its constituent tokens. A process which can be difficult for languages such as Chinese and Arabic, since these languages do not have explicit boundaries. Part-of-speech (POS) Tagging: a task of labelling each word in a sentence with its appropriate POS tag. Parsing: the process of constructing a parse tree given a sentence. NLTK modules contain source codes, written in python programming language that can be used or modified to accomplish natural language tasks [13] General Architecture for Text Engineering (GATE) Gate is a general Java based toolkit that provides an infrastructure for building language engineering systems. It provides resources for performing all sorts of natural language processing tasks and also an architecture which describes how language processing components connect to each other and a graphical environment. Gate is free open source software and comes with a set of modules that are used for information extraction and text mining [20]. Figure 3 shows GATE s environment. 7 Figure 3: GATE s development environment [20].

13 2.2.4 Choosing a Natural Language Processing Tool This section compares the natural language processing tools explained above, and then selects the most appropriate for this project. The main requirement for this work is to port the data from Oracle Database to open-source natural language processing toolkits. The ultimate goal of this work is to allow users to query the database using natural language. Gate is analysed first, followed by MontyLingua, and finally the NLTK. The GATE framework is more of a template for building language engineering systems rather than natural language processing. It provides a graphical environment that allows users to manipulate collections and documents easily. Figure 6 shows GATE s development environment with its Information Extraction components loaded. The utilisation of GATE is limited to Java programming language [20]. Python programming language is the chosen language to build the system that will fulfil the requirements for this project. Python was chosen because of its ease of use compared to other programming languages. Python programs are much easier to port to and can be easily modified to run on different platforms. For these reasons GATE framework was not suited to fulfil the requirements for this project. Both MontyLingua and NLTK were developed to process English text using python language, but NLTK was written with the aim to teach and allow students to learn and extend the existing components in their own linguistics tasks. Unlike MontyLingua, NLTK comes with large programming libraries that can be used and modified to accomplish some natural language tasks [13]. For these reasons, the Natural Language Toolkit was selected as the most suited to fulfil the requirements for this work. NLTK is a free open-source licensed under the GNU general public. The current version of NLTK and installation guide can be downloaded from the project s website at Below are the main modules from NLTK used to build the concertdatabase: WordTokenizer: for splitting text into tokens. cfg (Context Free Grammar): terminal and nonterminal symbols dictating how constituents are expanded into other constituents. 8

14 trees: for representing hierarchical structures according to some context free grammar, including graphically drawing tree. Parsers: for implementing parsing algorithms [13]. 2.3 Querying with Natural Language A natural language query system allows a user to access stored data by entering queries using some natural language such as English. This method of accessing data stored in a database is very easy and convenient. This is a great advantage, because the user does not need to learn complicated query language such as SQL in order to be able to access the data stored in the database. On the other hand, natural language systems present some disadvantages that users have to cope with most of them can only understand the grammar that is stored in the system domain. If the query contains words that are not covered by the system domain grammar, interpretation of the query would fail. Users queries are therefore limited by the system domain, where only certain types of queries can be interpreted [11] Building Natural Language Systems One of the main components to consider when building natural language systems is the Parser. A Parser breaks down a sentence into its component parts-of-speech (POS) using a set of provided grammar, and then constructs a parse tree. For example, the sentence, The children ate the cake would be parsed as shown in figure 2. The meaning of the nodes (i.e. part-of-speech) is shown in the box to the right of the parse tree [12]. NP S VP AT NNS VBD NP The children ate AT NN S --- > Sentence AT --- > article NN --- > noun NNS --- > noun, plural VBD --- >verb, past tense VP --- > verb phrase NP --- >noun phrase the cake Figure 2: Each node is assumed to be a constituent [12]. 9

15 Most parsers require a set of grammar to construct parse tree. There are two main types of grammar used for parsing; syntactic grammar and semantic grammar. A syntax-based system is built using syntactic grammar whereas a semantic grammar based system is built using semantic grammar [11]. This section analyses the differences between these two type grammars. One of them may be applied in the implementation this project concertdatabase system Syntax-based System In syntax-based systems the user s query is parsed using grammar which describes the possible syntactic structures of the entered query. If the sentence is valid according to the syntactic grammar, a parse tree of the sentence is then constructed, otherwise parsing will fail. An example of a syntactic grammar is shown below. S ---> NP VP NP ---> Det N VP ---> V N Det ---> which what when N ---> piano artist guitar violin V ---> play plays sing find The grammar above describes the possible structures of a valid sentence. It says that a sentence (S) consists of noun phrase (NP) followed by a verb phrase (VP), in turn a determiner can be which, what or when, etc A syntax-based system uses this grammar to construct the possible structure of the entered sentence. If the sentence contains words that are not covered by the grammar in the system domain, the system will not be able to parse it. For example, the following sentence which artist plays piano will be parsed as shown in figure 3, whereas the sentence which artists play piano will not be parsed. This is because the noun artists is not covered in the system grammar. Even though artist the singular of artists is in the domain, to the system these are two distinct words [11]. 10

16 S NP VP Det N V N Which artist plays piano Figure 3: A syntax-based parse tree Semantic grammar Systems Semantic grammars system relies mainly on semantic concepts rather than syntactic classification of words. These semantics concepts, which are often made of more than one word, are combined to build larger concepts until it gets back to a specific sentence that has been predefined. Semantic grammars are built to contain knowledge that is specific to a particular domain. If the knowledge base changes, the semantic grammar has to be changed as well to reflect these changes, otherwise the grammar would be useless for the new knowledge domain. For example, the question which rock contains magnesium would be parsed as shown in figure 5 using the semantic grammar in figure 4. S --- -> Specimen_question Spacecraft_question Specimen_question ---- > Specimen Emits_info Specimen Contains_info Specimen_spec ---- > which rock which specimen Emits_info ---- > emits Radiation Radiation ---- > radiation light Contains_info ---- > contains Substance Substance --- -> magnesium calcium Spacecraft_question ---- > Spacecraft Depart_info Spacecraft Arrive_info Spacecraft ---- > which vessel which spacecraft Depart_info ---- > was launched on Date departed on Date Arrive_info ---- > returns on Date arrives on Date Figure 4: A semantic grammar [11]. 11

17 S Specimen_question Specimen_spec Contains_info Which rock contains Substance magnesium Figure 4: Parse tree built with semantic grammar [11] 2.4 Database Models This section seeks to describe relational and flat-file databases and outline the weaknesses and strengths between these two models of database. Other models such as hierarchical model and network model do exist. The consideration of relational and flat-file databases is due to the fact that the system being built needs to port to the data stored in an Oracle database, which is a relational database. Flat-file database is considered as an alternative method for storing the data from the Oracle database Relational Database A database can be described as a collection of records or data that is stored on a computer system. A relational database is a structured the collection of data that are stored in tables or relations. Each table or relation has a unique name. In a relational database tables can be joined together during search so as to perform a search and display the results from the joined tables. A software called Database Management System (DBMS) controls the organisation, storage and retrieval of data in a relational database [17]. Relational databases present some advantages and disadvantages compared to other models. 12

18 Advantages: Security: a relational database provides better security to stored data from outside intrusion. Concurrency control: a mechanism that ensures that multiples users access data concurrently in a control manner without compromising the integrity of the stored data. Accessibility to data: allows users to retrieve, store, update data in the database in an organised manner [17]. Disadvantages: Relational database system requires a large storage space for storing the data and the software that control the retrieval and manipulation of data. It also requires that users learn an artificial structured language such as SQL in order to retrieve and manage stored data. Many natural language processing queries can not be expressed in the SQL language, making it difficult to manipulate data stored in a relational database using natural language processing toolkit such as the NLTK [17] Flat-file database A Flat-file database can be described as a simple database system that is designed around a single table. In contrast to relational databases in which data can be stored in multiple tables, flat-files databases store all in one single table or list. Each field in a record can be delimited by whitespace, fixed width, tabs, commas (CSV) or other characters [18]. Data stored in Flat-file databases are prone to corruption due to the fact that there is no control mechanism to manage the access and modification to data. Searching a flat file database requires going through every record until the required record is found. As the data gets bigger in the database accessing them becomes very challenging. There is no mechanism to prevent duplicating data or records. These disadvantages are not present in relational databases. Despite these disadvantages flatfiles databases present some advantages over rational databases. Advantages over relational databases: 13

19 Available and Versatile: a flat-file database can be implemented using any operating system. There is no need to install additional software to manage the flat-files stored in the database. Flat-files can be read by various programs. Smaller and Easy: Flat-files use less space compared to relational database, they are easy to create and are particularly useful for making data available and accessible to other programs [17]. Python programming language comes with a CSV module for reading flat-files saved in a CSV format, thus facilitating the manipulation of the data stored in the flat-file database. Many natural language queries can be formulated to access data stored in CSV format, therefore making CSV format a suitable candidate format to store flatfiles for this project. 14

20 3. Requirements Analysis This section is the first phase of the Waterfall Model methodology used for this project. It involves methods for gathering and defining user requirements for the system. This phase is very crucial to ensure that user requirements are not misunderstood from the beginning of the project. This phase will look at the requirements for the new system and why the existing system can not achieve the requirements for the new system. 3.1 Method used to gather requirements There exist a number of techniques used to gather system requirements including observation, interviewing, questionnaires, and studying documentation [2]. Studying documentation was the methods used in this project to gather the requirements for the new system. The reason for choosing this method was due to the fact that this project was assigned by an external client. Researchers from Leeds School of Music had already undertaken a pilot project to collect a database of records documenting concert life in nineteen-century London and have designed a web site that allow users to access the information stored in the database. Couple of meetings were arranged to meet with Rachel Cowgill from the School of music responsible for running the entire project. She provided some documentation and advised us to visit the pilot website in order to have more insight of how the system works. 3.2 Gathering Requirements From the provided documentation and the website it was established that the existing system was designed to store data collected from journals, newspaper adverts and reviews documenting concert life in the nineteenth- century London and allow users access to the stored data. The collected data were stored in an Oracle relational database. To access stored data users are required to learn SQL. The aim of this project is to contribute to the development of the project to allow users to access the collected data using Natural Language Processing methods. Since the existing database system was designed using the relational model, accessing data using natural language processing methods would be impossible unless an interface that would 15

21 translates natural language queries into SQL queries was built. Building such a system would be very difficult and computationally expensive. 3.3 Functional Requirements This section lists the functional requirements gathered from reading the documentation provided by Rachel Cowgill and from observing the pilot concertlife project website at There are categorised as essential or desirable. Those classified as essential need to be achieved in order to meet the minimum requirements for this project, and those that are classified as desirable are the possible extension for this project, and thus adding more capabilities to the system. No. Requirement Type 1 A different format concertlife database Essential 2 User interface that links to stored database Essential 3 Allows users to query database using natural language (e.g. Essential English) 4 Display results from the database Essential 5 Display syntax errors from query Desirable 6 Display welcome message Desirable 7 Option to display help commands Desirable 8 Option to display parse tree for the query Desirable 9 Graphical User Interface (GUI) Desirable Table1: Functional Requirements [15] 16

22 4. System Design This section describes the design of the system that I propose to build in order to meet the system requirements discussed in the previous section. The system architecture is shown in figure 5. As mentioned in the previous sections, the purpose of this project is to build a system that will accept users queries in natural language and process them to search the database. To build a system that would fully understand natural language queries is very difficult due to syntactic ambiguities that a sentence may have. For example, old men and women does old have a wider scope than and or is it the other way around? Another example, I saw the man with a telescope who has the telescope? To build a system that will represent all the semantic meaning of such sentences is very complex due to the vast amount of information needs to be recorded in order to parse the sentences [15, 16]. In this project I have proposed to build a system that restrict user queries to a small domain. In this approach I have created syntactic grammar that is used to parse users queries. If the parsing succeeds, a parse tree is generated according to the grammar specification, and then a keyword is extracted from the parse tree to search the database. If parsing fails, an error message is displayed informing the user that the system does not cover some words in the query. The proposed system architecture is structured as follow: System Architecture: describes the overview of the concertdatabase system. Database design: describes the database model used to store the converted data from the Oracle database. Parser: describes how queries are parsed using syntactic grammar. Keyword Extraction: extract keyword from the parse tree used to search the database. User Interface Design: describes the design of the interface that will allow users to enter their queries and display output from database. 17

23 Query Parser Keyword Extraction Search Database Query Results Figure 5: ConcertDatabase System Design 4.1 System Architecture The system architecture shown in figure 5 is designed to accept users query, parse the query, extract the keyword from the query and use it to search the database, and then display the results to the user. 4.2 Database Design Database models were discussed in previous section. This section describes the database model chosen for this system. As mentioned in previous section data stored in the Oracle database can only be accessed using SQL. It was established that the data stored in the Oracle database had to be converted into a different format to allow users to access the data using natural language (e.g. English). The candidate database model to consider is the Flat-File database. Relational databases have some functions 18

24 that allow connection to multiple tables in the database in a single connection. Flatfile database can combine tables together to emulate such behaviour. The problem with combining tables to allow is that the database will contain different kinds of data stored in one big table making the search process very slow. Another possibility would be to save each table from the Oracle database separately in a flat-file format. This approach will result in a new file open operation for each table [17]. Despite these disadvantages, flat-file database offers a solution to needed to build the ConcertLifeDatabase system cheaply Flat-File Format Compared to other flat-file formats such as tab, fixed-width, Comma Separated Value (CSV) format is the most common format for transferring data between database applications. They are easy to create and maintain. Python programming language comes with a CSV module that can implement classes to read or write CSV file format. Since Python is the programming language of choice to implement the concertdatabase system for this project and also for reasons mentioned above, it would seem reasonable to use the CSV file format to create the Flat-file database. In its simplest form a CSV file consists of records and fields. Each field is delimited by a comma, and the records are separated by suitable end of file character such as a carriage return. Below is an example of a CSV file [18]. Artists, concert life database, , 182, 222 Instruments like piano, 3033, 7890, , 1, blaise, UK, 2, 4 The following CSV file example has three records and five fields. If a field contains some leading commas, space or carriage character, the field would be enclosed in double quotes. Oracle databases possess some functionalities to import or export data to CSV format. This task was not performed in this project since Rachel Cowgill provided some filed in an excel format which were then transformed into CSV format by simply saving them using the extension CSV to the file name as in people.csv. Below is a figure of one of the files provided by the Rachel Cowgill, which was converted to CSV format. 19

25 Figure 6: people.csv The Flat-File database is created in the server where the python interpreter resides. If the database is created in a separate location as the python interpreter, a path to the database must be specified for the python program to locate the file. It is therefore much easier to create the flat-file database in the same location as the python interpreter. 4.3 Parser This section describes methods used to parse users queries in order to extract a keyword to search the database. In previous section syntactic and semantic grammars were described. Semantic grammars are mostly used to parse queries so as to map the semantic meaning of the query to some predefined representation of the query in the system domain. To extract a keyword from the user s query, a syntax-based grammar is the most appropriate as it classifies each word in a query into its part-of-speech constituent and then builds a parse tree using the syntactic grammar. From this tree a keyword could then be extracted. 20

26 4.4 Keyword Extraction To search the database a keyword is extracted from the parse tree and used to search the database. The extraction of the keyword from the parse tree depends on the type of the query entered by a user. The keyword extracted from a which query is expected to be different from the one extracted from a find or retrieve query. 4.5 User Interface Design A user interface will allow users to enter queries in natural language and display the results from the database. User interface will display a welcome message, a message prompting the user to enter a query or to enter some other commands such as the help command or exit command. Figure 7 shows an example of how the implemented user interface looks like. Figure 7: User Interface for the concert Database system. The concertdatabase has been tested on windows command prompt and on UNIX shell. To run the program, at command prompt enter python concertdatabase then press enter. The screen similar to the one shown in figure 7 should appear with the messages of welcome. Below is a list of the concertdatabase features: When a user enters a valid query the system should parse the query, extract a keyword, search the database then display the results if there is a match or a no match message if there is no match. When a user enters an invalid query the system should display a message containing the invalid word(s), and an active prompt so that the user could reenter a new query. 21

27 When a user enters the command %show_grammar, the syntax grammar used for this system should be displayed. When a user enters the %help command, a list of available commands should be displayed. When a user enters the %exit command, the concertdatabase should be closed. 22

28 5. Implementation and Testing This section discusses the implementation and testing of the system, including the software that is needed for the concertdatabase system to run properly. The first requirements of this project was to reproduce the data stored in the Oracle relational database into a different format, then the reproduced database should be ported to some open-source natural language toolkit to allow users to query the database using natural language. This section includes: Database Implementation. Natural Language Toolkit. Choice of Programming language. User Interface implementation. Parser Implementation Keyword Extraction implementation Database Search 5.1 Database Implementation Flat-File database as explained previous sections was the appropriate model to use to store data from the Oracle database. The flat-file database will contain files saved in comma separated value (CSV) for the reasons mentioned in section 4. The files provided by Rachel Cowgill as follow, concert_items.xls, people.xls, venues.xls and works.xls. These files were converted into CSV format and saved in the server where the python interpreter and the Natural Language Processing Toolkit (mentioned below) reside. These four files contain many empty fields and many alphanumeric data. A screen shot of one of the files is shown below in figure 8. Since a keyword is used to search the database, a sequential search is performed across the entire database until a match is found. If a match is found the system outputs the entire row(s) in CSV format. If there is no match, the system outputs a message informing the user that there is no match. 23

29 Figure 8: Flat-file in CSV format. 5.2 Natural Language Toolkit Section discussed the reasons why NLTK was selected as the Natural Language Processing appropriate for this project. This section will not repeat these reasons. NLTK was downloaded from the website. It is released under open-source license and available for students to modify and use in their Natural Language Processing tasks. NLTK must be saved in the same location as the python interpreter to make it easier for python program to import the needed modules to perform NLP tasks. Before using an NLTK module it must be imported first. 5.3 Choice of Programming language Python programming language was chosen as the language of choice to implement the concertdatabase primarily because it is a relaxed programming language compared to Java for example. In Java for instance variables types always need to be specified before use, whereas in Python you do not have to specify variables type before use. Python programs are much easier to use and modify to perform natural language processing tasks. The main reason for using Python was due to the fact that NLTK was written in Python, it made perfectly sense to choose Python. 24

30 5.4 User Interface implementation The concertdatabase was designed to run on a Windows prompt or UNIX shell. To run the program, at the command prompt you enter the python concertdatabase.py as shown in figure 9. Figure 9: User Interface for the concertdatabase. If a user enters %help to view the available commands, the outcome of this action is shown in figure 10. Figure 10: Output of %help command. 25

31 If a user enters %show_grammar the result is shown in figure 11. Figure 11: Result of %show_grammar command. If a user enters %Exit the result is shown in figure 12. Figure 12: Result of %exit command. 26

32 5.5 Parser Implementation Before a keyword is extracted a syntactic parser needs to process the query and built the parse tree. NLTK comes with module that contains parsers, this module have to be import in the concertdatabase before using the parser. In addition to that syntactic grammar needs to be created for the parser to use. Figure 13 shows the syntactic grammar that was created for the concertdatabase system. grammar = parse_fcfg(''' # sentences S -> FindQ S -> WHQ S -> SHOW # A which-question WHQ -> WH N Vbd NP WHQ -> WH N Vbd N # A find, show, retrieve question FindQ -> VB PP ADJ N SHOW -> VB PP N VB -> "find" "show" "retrieve" NP -> Det N WH -> "which" PP -> Prep Det ADJ -> "russian" "french" "english" "italian" "swedish" "france" "german" Det -> "the" "a" "an" Prep -> "all" "to" "for" "by" "before" "after" "during" Vbd -> "played" "performed" "sang" N -> "artist" "artists" "performers" "piano" "guitar" "clarinet" "brown" "violin" "violinist" N -> "composer" "france" "germany" "publisher" "singer" ''') Figure 13: Syntactic grammar used to parse users queries. The following piece of code create a function parse( sentence) that take a sentence as parameter, splits the sentence into tokens then return a parse tree of the sentence. def parse(sentence): tokens = sentence.split() parser = nltk.chartparser(grammar, nltk.parse.bu_strategy) return parser.nbest_parse(tokens) For example, if a user enters the following query Find all the Russian artist. 27

33 This query would be parsed as shown in figure 14. Figure 14: A parse tree for the query Find all the Russian artist. If a user enters some words that are not covered by the syntactic grammar result system displays an error message informing the user of the invalid words as shown in figure 15. In this example the user enters the query find all the frenchs artis which contains the invalids words frenchs and artis the system displays the error message Grammar does not cover some of the input words: frenchs, artis. This result means that the parser fails to parse the user query because some of the input words are not in the grammar. Figure 15: output displaying error message. 28

34 5.6 Keyword Extraction Implementation The following codes extract a keyword from the parse tree depending on the type of query entered by the user. The function def interpret takes as parameters a parse tree and calls function eval_s(x) to evaluates the node of each tree (if there happens to be more that one) the tree at height 0 in order to work out what type of query the user has entered. If the type of query is established, the appropriate function is called in order to extract a keyword to search the database. def interpret(sentence): return map(lambda x: eval_s(x), parse(sentence)) def eval_s(tree): # which question if tree[0].node == 'WHQ': output = eval_whq(tree[0]) # find question elif tree[0].node == 'FindQ': output = eval_findq(tree[0]) elif tree[0].node =='SHOW': output = eval_show(tree[0]) else: output = None return output # return a noun def eval_whq(tree): noun = eval_n(tree[3]) return noun def eval_n(tree): return tree[0] def eval_findq(tree): adj = eval_j(tree[2]) return adj def eval_j(tree): return tree[0] 5.7 Database Search To search the database a keyword must be extracted from the parse tree. After extracting a keyword the database is searched. If a match is found in the database the row(s) in which the keyword appears is/are displayed in the shell, as shown in figure16. 29

35 Figure 16: Result from database If there is no match a message in the database a message reflecting this is displayed, as shown below. Figure 17: No match in the database. 30

36 6. Evaluation This section evaluates the implementation of the concertdatabase system to ensure whether it has met the minimum requirements set at the outset of this project. The most obvious way to evaluate the built system is against the minimum requirements. This section will include evaluation against the possible extensions that were proposed to be built time permitting. The aim of this project was to port the data from Oracle database to open-source corpus such as the Natural Language Toolkit. 6.1 Meeting the Functional Requirements The first objective was to reproduce the data in Oracle database into a different format in order to allow users to query the database using natural language. Since it was not possible to access the Oracle database Rachel Cowgill from the School of Music provided the author some files in an excel format with the data from Oracle database. A Flat-File model database was created with the files saved in Comma Separated Value format to facilitate better access using python programming language. Natural Language Toolkit was the open-source Natural Language Processing Toolkit of choice for this project. This was downloaded and used to build the concertdatabase system. I believe that I have achieved this first objective. The second objective was to search the reproduced database using natural language. Again I believe that I achieved this objective. The third was to implement a user interface that will allow users enter their queries and display the results. This object was also achieved. 6.2 Meeting Possible Extensions Following the feed back that I received for my mid-term report with suggestions that the possible extensions may not be sufficient enough, I decided to modify them to include the following: An extension to the User Interface to include features such as: 31

37 When user enters words that are not covered by the grammar, a message is displayed showing the invalid words. A Help command, when selected should display all the available help commands such as, %display_grammar, %display_tree, %exit, etc I believe that the system that I have built features these extensions. GUI extension was not developed due to lack of time. 6.3 Conclusion Building systems that can query databases using natural language is a very difficult task to achieve due the fact that languages contain many ambiguous words. The approach used in this project to build such a system was to restrict the types of queries that the system can accept. If the query is not covered in the system domain it is rejected as invalid. Using this approach the author feels a sense of pride to have built a system that use natural language to query the database. The built system seems a bit simplistic, but it can be extended by increasing the system domain so that it can accept more queries. Future work could include developing a Graphical User Interface and perhaps semantic grammar as well. 32

38 References [1] P. Bocij, D. Chaffey, A. Greasley, and S. Hickie. Business Information Systems. Practice Hall, third edition, [2] D.E. Avison and G. Fitzgerald. Information Systems Development: Methodologies, Techniques and Tools. McGraw-Hill, second edition, [3] NLTK. Introduction to Natural Language Processing. Last accessed 5/12/2007 [4] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modelling Language User Guide. Addison Wesley, second edition, [5] CMS. Selecting a Development Approach. proach.pdf. Last accessed 5/12/2007. [6] R. Dale, H. Moisi, and H. Somers. Handbook of Natural Language Processing. Marcel Dekker, Inc [7] J. Dibble and B. Zon. Nineteenth-Century British Music Studies. Ashgate, volume2. [8] Concert Life in Nineteenth-Century London Database. School of Art, Publishing and Music, Oxford Brookes University. [9] Concertlifeproject. Concert Life in 19th-Century London Database and Research Project. Last accessed 7/12/2007. [10] T. Connolly and C. Begg. Database Systems: A practical Approach to Design, Implementation, and Management. Addison Wesley, fourth edition, [11] I Androutsopoulos, Ritchie G.D and Thanisch P. Natural Language Interface to Databases An Introduction. Last accessed 12/03/

39 [12] D. Christopher, Manning and Schutze H. Foundation of Statistical Natural Language Processing. Massachusetts Institute of Technology, [13] Crossroads: Getting Started on Natural Language Processing with Python. Last accessed 20/04/2008 [14] A. Zamel: Structured Sentences Text Editor. last accessed 10/03/2008 [15] Natural Language Toolkit: last accessed 20/04/2008 [16] ELF Software Documentation Series: [17] Web Site Owner: Database Types. last accessed 21/04/2008 [18] Paul Bourke: CSV Comma Separated Value [19] MontyLingua. A free commonsense-enriched Natural Language Understander for English. last accessed 20/04/2008 [20] Advanced Knowledge Technologies: General Architecture for Text Engineering, last accessed 19/04/2008 [21] Pendar. last accessed 15/03/2008. [22] The University of Edinburgh School of Informatics. last accessed 7/03/

40 Appendix A Personal Reflection At the beginning of this project I did not understand very well what I was supposed to build. I spent a significant amount of time trying to understand what the project was all about. Then eureka I shouted, I got it now, I understand what the project is all about. How wrong I was, I spent a long time searching for information that did not have any thing to do with my project. It took me quite sometime to realise that the research that I was conducting did not have any thing to do with my project. I only realised that I did not understand the problem when I wrote a program in python to perform what I thought was part of my project, the results were so minimal that I asked myself is that it? the program contained about 30 0r 40 lines of codes. Following this I decided to really sit down and go through the project description that s when I realised that I completely misunderstood the project requirements. With the mid-term report deadline approaching I had to spend sleepless nights to do my background research and write up my mid-term report as soon as possible. To add fuel to the fire, I fell ill. I had to struggle to complete the report. During the second semester I spent a lot of time building the system. I am not particularly good at programming; it took me ages to build the system. I have learned a lot from this experience, particularly time management, research methods and discipline. If I had to restart the project I would definitely start early, choosing a project methodology as early as possible and design a timetable with milestones. Sticking to the schedule and following the project methodology are the most important parts of the project. I feel a sense of pride having done this project. I would strongly advise future students to spend more time at understanding the project description and to start doing background research as soon as possible. This has taught me a lesson. I have benefited from it though. 35