Abstract 1. INTRODUCTION

Transcription

1 A Virtual Database Management System For The Internet Alberto Pan, Lucía Ardao, Manuel Álvarez, Juan Raposo and Ángel Viña University of A Coruña. Spain {alberto,lucia,mad,jrs,avc}@gris.des.fi.udc.es Address: Dpto. Electrónica y Sistemas. Campus de Elviña S/N, Universidad de A Coruña. Spain Tlf: Ext Fax: Abstract Virtual Databases (VDB s) differ from standard databases because data are not really stored into the database. In turn data can be stored remotely in several heterogeneous semi-structured sources. Virtual databases offer an uniform way to query and integrate this information. We present a VDB system which focuses in the reuse of the public information available in the World Wide Web, providing programmers with an easy and quick way to use that information. 1. INTRODUCTION The World Wide Web has become a huge repository for all kind of information. Many applications could get substantial benefit if they could easily and efficiently query this repository. But WWW information is written in HTML pages which are humanreadable through a web browser but which are not machine-readable in a straightforward manner. This is due to the lack of semantic capabilities in HTML and because this is not usually an issue for HTML authors when the pages are created. Nevertheless, much of the WWW information is not completely unstructured. Many web sites provide information on a semi-structured way. Typical examples include HTML tables, outputs from on-line search forms, etc. Virtual Databases [1] provide a way to get benefit of this huge repository. We present a Virtual Database Management System (VDBMS) that has proven to be useful in many real-world applications [2] [3] [4]. Virtual Databases (VDB s) differ from standard databases because data are not really stored into the database. In turn data can be stored remotely in several heterogeneous semi-structured sources. Virtual databases offer an uniform way to query and integrate this information. We present a VDB system which focuses in the reuse of the public information available in the World Wide Web, providing programmers with an easy and quick way to use that information Application developers can easily create a Virtual Database in our VDBMS specifying the table structure of the database as they would do in a standard DBMS. Then they will also specify the web sources from which the data will be extracted along with a simple description of the source. Then the VDBMS automatically generates wrappers for a transparent access to these sources, so that programmers can write standard database queries to access the data. Key issues in building a VDBMS are performance and how easily programmers can automatically generate wrappers for the desired information sources.

2 To improve the sytem performance we rely on asynchronous multi-thread operation and in an cache system. Besides the normal cache operation, our cache system is able to transparently do pre-loadings of the most frequently requested data and to answer queries by filtering previously cached ones. For an easy and quick wrapper generation, we have developed an innovative tool which lets users write specifications describing the sources in a simple language. The entire process of adding a new information source often does not take more than 5-10 minutes. As we have remarked previously, our VDBMS has already been successfully used in a number of real-world applications including the first comparative shopping tool in the Spanish Internet [2] and several projects to provide web content to Internet enterprises and audience sites in domains as traffic, flights, tourism, financial products comparison, [3] [4] [11][12]. Section 2 of this paper is an overview of the system architecture. Section 3 shows how to create a table in the VDBMS including the process needed to fill the table with data from remote web sources. Section 4 focuses in the wrapper generation tool. Section 5 focuses in the cache system. Section 6 shows some real-world examples using the VDBMS. Section 6 list conclusions of this work and outline future improvements. 2. SYSTEM ARCHITECTURE Figure 1 shows an schema of the VDBMS architectural components.. Section 2.1 shows how the components of the architecture interact to answer a query against a table in the VDBMS. Application Program Query Interpreter Query Language Data Diction. Query Engine Cache Filter Engine Wrapper 1 Wrapper 2 Wrapper n Specification Language. Source 1 Source 2 Source n

3 2.1. Answering queries The Application program can make queries against the VDBMS in a specific language. Our query language is currently quite simple. Queries are restricted to an unique table and a typical query looks like this: (field operator {value1,,valuen}) relational-operator (field operator {value1,, valuen}) sorgroup-operator {field1,,fieldn}. For instance, if we had a table named BOOK with fields TITLE (String), AUTHOR (String), PRICE (Money), then we could write a query like this: (title contains { java, xml }) and (author contains { Rick Smith }) and (price lessthan {(30,EURO)}) sortby asc {price}) for retrieving the rows in the table representing books with the words java and xml in their title, written by Rick Smith and with a price under 30 Euros, sorted by ascending price. The query interpreter is in charge of parsing the query and transform it to an internal format. The Data Dictionary of the VDBMS must be accessed here for ensuring consistence between the query and the table schema. Then the query engine starts to resolve the query. If the cache is activated for the queried table, then the query is sent to the cache system. If the cache system is able to resolve the query, then it return the results to the query engine which can return them to the invoking application. See section 5 for a more detailed explanation of this process. If cache is not activated or it can not answer the query, then the query engine looks in the Data Dictionary for the relevant sources for this query. Usually the relevant sources for a query against a certain table will be all the sources associated with the given table, but it is possible to choose alternative sources depending on the fields of the table involved in the query. Now, the query engine dinamically creates one wrapper for each relevant source. In order to create the wrapper, the query engine looks in the Data Dictionary for an specification associated with the source (which was written by the table creator when the source was added). This specification is the input to the wrapper generation tool, which automatically generates the wrapper for the source (see section 4 for details and an example of a specification for adding a web source). Then the wrapper is in charge of obtaining the partial results to the query provided for a given source. Therefore, the wrapper must be able of reformulate the query in terms of the remote source (so the query is understood by the remote source) and also it must be able of understand the source output format in order to map the output given by the source to the table structure. When the remote sources are web sources (the usual case), to make a query in the source means automatically fill some kind of web form and execute an HTTP GET or

4 POST operation aginst the web server of the source (in that sense, a VDBMS can be seen as a sophisticated type of metasearch engine). To understand the source results means parsing the HTML (and sometimes XML or Javascript) returned by the search to extract the found items (for instance: parse the books returned for an online web search form on an online bookshop). Obviously, wrapper generation is a key point. If we pretend a VDBMS to be a powerful tool, it is very important that the wrappers can be easily created. That is, it should be easy and quick to write the specification for adding a given web source. Section 4 explains our wrapper generation system in more detail but we want to remark that our wrapper generation tool has showed itself powerful and easy to use. Currently, we have extracted information from more than 250 different web sources in many application domains (see section 6 for some remarkable examples). The specifications are usually written by non-programmers (it is only needed to know some HTML and HTTP concepts to add new sources in our system) and the typical time for adding a source are between 10 and 20 minutes. When the query engine receives each result from the wrappers, the filter engine works to ensure that the given results are consistent with the query and with the table schema. This is needed because wrappers are not forced to return complete valid results in certain cases. For instance, to make the previous query about books of java and xml written by Rick Martin and with a price minor than 30 euros, directly in a remote source, that source should have a web form which let users search the books in its database by title, author and price. But many online bookshops do not have so detailed search interfaces. For instance in many bookshops users can only search by title or author but not by both or perhaps it is permitted to search by both title and author, but not by price. In this kind of situations, the wrapper chooses to search a more general query and to let the filter engine remove the unwished results. For instance, the wrapper could search only by title knowing that the filter engine will remove the results which not match the author and price search criterias. With all the results returned by the wrappers the query engine constructs a result set. If the query contains sort of group by operations over the data, the query engine uses the filter engine to execute them over the result set. It is important to note that the VDBMS also can operate in an asynchronous manner (see section 5 for details). That means that application can access the result set before it is complete (so the application need not to wait for all the results to process the already received ones). Finally, the result set is returned to the application program. If the cache is activated, then the result set is also stored in the cache. 3. CREATING A TABLE IN THE VDBMS The process of creating a table in our VDBMS involves two steps: (1) Define the table schema and (2) Configure wrappers to search and extract data from web sources. Step (1) is not very different of the process of creating a table in an standard DBMS. A table schema consists of a list of fields. Each field has a data type and can have

5 associated constraints. Some data types currently available in our VDBMS are Strings, integers, long integers, money, date, URL, etc. Some restrictions currently available are: uniqueness, field not null, field not searchable, must exist and be accesible (only applicable to URL fields), etc. Step (2) includes to configure wrappers for each data source. For each source a wrapper must know: 1) how to search in the source and 2) how to understand the results of a search. Both (1) and (2) are made using a graphical web administration tool and no code at all is required. 4. WRAPPER GENERATION TOOL In this section we will describe the wrapper generation tool, which is able to generate wrappers around semi-structured web sites without using any domain specific heuristic (we have conducted several successful tests of the tool with many Web Information sources in different domains). A wrapper for a web source must be able of doing two different things over the source: 1) how to search in it and 2) how to understand the results of a search. 1) requires to be able of automatically generate and submit HTTP web forms. When a user of the VDBMS needs to add a new source provides our wrapper generation tool with an URL to the page where the desired HTTP form is located. The tool is able of automatically find the web forms of the page and present them to the user. Then the user can associate the fields of the web form with searchable fields of the table (for instance, the user would associate the field for searching by title in an on-line book-shop web form with the field title of the table BOOK ). Then the tool makes the rest of the work and generates a URL pattern that will be used to search in the source. The user can also associate fields in the form with operators in the VDB. For instance he/she can associate a search by keyword checkbox in an HTTP form with the containskeyword operator of the VDB. 2) requires parsing the result of one or many HTTP request (usually HTML pages) and extract the obtained results from them. Our parser generation tool can be used, by even users without technical capabilities, to write specifications describing the external appearance in a web browser of a pattern of information to be extracted from a set of HTML pages. Then, the tool is able to generate a wrapper to automatically extract information according to this pattern from the specified pages. Therefore, the tool avoids the need to write a specific parser for extracting the desired information. We will consider a simple example for illustrating the use of the extraction tool. As our example we chose the Internet bookshop Amazon [5]. Figure 2 shows a snapshot of the answer of the AMAZON Internet bookshop to the query books which contain the word java in their title. The specification for extracting information from this page is showed in Figure 3.

6 Figure 2: Amazon snapshot ANCHOR (TITLE) ~ IRRELEVANT? EOL AUTHOR / IRRELEVANT EOL Our Price : $ PRICE[CURRENCY=DOLLAR] ~ IRRELEVANT? EOL Figure 3: AMAZON specification With this specification and the showed example page, our tool will find an instance of the pattern for each book in the results page. The idea is that the user writing the specification tries to reproduce the visual aspect of the pattern which is trying to match. In the actual state of the tool, the reserved word ANCHOR is used to indicate an HTML link and EOL indicates an end of line. Names such as TITLE or PRICE are character strings naming the attributes that we want to obtain from the occurrences of the pattern in the page. We will call these reference-names. For each instance of the pattern found, the tool will produce a sequence of tuples matching each referencename in the specified pattern with the real value found in the pattern. For instance the first book in the results page would make the tool match a pattern with the following tuples: { (TITLE, Abstract data types in Java ), (AUTHOR, Michael S. Jenkins ), (PRICE, 40.46, CURRENCY=DOLLAR ) }. In this example we have to point some other features: 1) It is possible to embed some application-specific meta-information in the specifications, enclosing it between [ and ] immediately after a reference-name. For instance we write PRICE[CURRENCY=DOLLAR] in the last line of our specification. The tool generates for the PRICE reference-name a 3-upla with the form (PRICE, the-extracted-price, CURRENCY=DOLLAR ). It is the application responsibility to correctly use the context information of a referencename. 2) "IRRELEVANT" is a reserved name used to represent attributes inside the pattern that are no relevant for our purposes and so, they should not generate a pair in the

7 output. For instance, here we suppose that the availability information provided in the first line of the pattern is not relevant for our purposes and we do not want it to be returned. 3) We can use string separators to divide the text items inside the pattern. For instance in the second line of the pattern, we use the separator / to separate between author information and the other text information on the same line, that we suppose irrelevant for our purposes. 4) It is usual to find patterns with optional parts. For instance in Amazon availability information appear only in some of the results. We can enclose this optional parts between the and? characters. There are a lot of features that we will not expose here for simplicity and extension. Here we mention some of them. 1) We can extract information from multi-page outputs traversing HTML anchors, with hierarchically capabilities. For this purpose we can define sub-specifications inside the main specification. 2) Support for multi-valued items when the number of values is variable. For instance, the number of the main players in a movie or the authors of a book (note that this feature was not used on the Amazon example). 3) We can apply operators to the reference-names for transforming the values assigned to the attribute or filtering certain matched patterns. 4) We can write alternate specifications for information sources that use different answering formats to the same query depending of the number and kind of the results obtained. 5) It is possible to assign default values for attributes. 6) Etc Design and implementation Overview At an internal level, our tool is divided in two modules. Both use the tool Jflex[6] to generate scanners. For parsing, we have built our own parser tool. The first module parses the specification written by the user and generates an internal representation of it. The second module uses the internal representation of the specification to really extract the information from the source. The scanner divides the source into tokens. The parser looks for patterns and also takes care of checking that the text items are correctly structured according the specification. The parser also has to treat with optional parts of the specification, multi-valued items, etc. A higher layer is used to rule more advanced behaviours as traversing links in multi-page outputs. 5. PERFORMANCE AND CACHE Performance is one of the key issues involved in building a VDBMS;. Very often the data accesible through a VDBMS are extracted from remote sources and therefore performance is a major concern. In this section we outline our main strategies for improving performance in our VDBMS. When data are extracted from remote sources it is often desirable not to wait for the entire collection of data to be extracted before returning some results to the application.

8 For this reason our system can operate in an asynchronous and multi-thread basis when extracting information from remote web souces. The system starts one thread for each information source. Each thread has a maximum lifetime, and when one of them overcome it, it is suspended. Results are available to the application as soon as they are extracted. That means that the Result Set of the query can be accessed before the query is really finished. Application can also perform asynchronous filtering and ordering operations. For instance an application can execute asort operation over an uncomplete Result Set of an unfinished query. The results available at the moment at which the operation is executed will be sorted. New results will be added at the end of the ResultSet as they arrive. The application could execute a new sort operation in order to sort again the Result set when the query is complete. Another key element for improving the performance of the VDBMS is the cache system. If the cache is activated in a table of the VDBMS the result of a previous query can be used to answer a later one. The cache is able of, starting from a more general query with a entry in the cache, apply filtering processes to answer to less general queries without needing to extract again the data from the remote sources. For instance, we can have a table book filled with data extracted from the main Internet bookshops. Suppose the VDBMS receives the query (TITLE Contains java ) AND (TITLE Contains xml ). If there is an entry in the cache corresponding to the query (TITLE Contains java ), the cache will answer the query applying a filter to this cache entry obtaining the results that also have the word xml in its title, therefore avoiding the need of extract the data from the remote sources. The cache entries have a configurable lifetime. For getting upper cache agree, the system carries out pre-loads of frequently requested data which have timed-out. Data are also pre-loaded to obtain data from sources which failed in the past, because of a network error or source-server congestion. Pre-loads of data can be scheduled by system administrator, so they can be executed when the system workload is low. The cache system can operate only over a certain table or over the entire Virtual Database. Multiple servers can share the cache by making persistent the cache entries in a shared storage space. This is used, also, as a second level cache (the first level of the cache is stored in the memory of each server in order to get lower response times). 6. REAL WORLD EXAMPLES The VDBMS explained in this paper has already been used successfully in some realworld aplications. Some examples are: - The first comparative shopping tool in the Spanish Internet [2] - A MP3 search engine [3] - A web service for comparison of financial products [4]

9 The comparative shopping tool define a table for each type of product that can be searched. To fill the table of a product (e.g. books) with data, we extract information from the main Internet shops around the world which sell that product. Then it is possible to make queries against a table for obtaining the products satisfying certain conditions. Results can be filtered and sortered according to criteria as price, shipment fees, delivery times, etc. This application was developed to the spanish search engine Biwe and is currently accesible in its website. It will also be included soon in other well-known spanish audience sites. The MP3 search engine acts as a metasearch engine over the main MP3 crawlers in the Internet. In this case the VDB has an only table containing MP3 files. An special filter was added to this application: if it is required, the system can check that the MP3 files really exist in the server, a problem very common when downloading this kind of archives. Following the idea of comparisons between products of the same domain, we also have built a comparative tool for financial. This tool looks for financial products like deposits and mortgages and compares these products across the main banks and financial entities with presence in Spain. This application was developed to the spanish bank ebankinter and is currently accesible in its website. Besides these examples our VDBMS has been used to provide web content to Internet enterprises and audience sites from many different domains and web information sources such as traffic, flights, employment, auctions, entertainment, travels, financial information and so on. Some examples are DGT [7]: the traffic general direction of Spain, AENA[8]: flight information of all the spanish airports, InfoJobs.Net [9]: a complete Spanish employment exchange, ebay auctions [10], the Miami Herald Newspaper [11] (entertainment information like movies, theater, concerts, life at night, restaurants, etc.,) and NASDAQ Stock Market [12].

10 7. CONCLUSIONS AND FUTURE WORK We have presented a Virtual Database Management System which focuses in the reuse of the public information available in the World Wide Web, providing programmers with an easy and quick way to use that information in their application programs. This way, programs can get benefit of the huge amounts of useful semi-strucutred information available in the World Wide Web. Our VDBMS let programmers define tables of a Virtual DataBase and fill it with data extracted from remote web sources. In order to add a web source a simple specification describing the source is needed. With our innovative wrapper generation tool, this process can be made by non-programmers in minutes for average sources. For improving performance, our VDBMS includes a cache system able of pre-load useful queries and able to answer new queries by filtering more general previous ones. Our future work include to improve our query language with more complex structures as joins between different tables, making it more similar to Objectual DataBase query languages such as OQL. We also are improving our wrapper generation tool to include support for sources with complex Javascript. REFERENCES [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]