WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World Wide Web Charles A Wood and Terence T Ow Mendoza College of Business University of Notre Dame Notre Dame, IN 46556-5646 cwood1@ndedu ow1@ndedu ABSTRACT Researchers point out that a great source of data that can be used to generate more knowledge can be found within the World Wide Web In this research, we extend SQL using a new Webview construct that will allow ad hoc joins from a database to data found on the Web using ANSI-standard SQL We also develop a tool used to implement this language, and using this tool, we show how the proposed Webview construct can be used to join data from Web pages and databases together This tool can be used to dynamically gather data from the Web for use within corporate databases, research data sets, and knowledge management repositories Keywords: Agents, Data Mining, Databases, SQL, Web Data Retrieval Page 1 of 13
WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World Wide Web Charles A Wood and Terence T Ow INTRODUCTION Knowledge management (KM) knowledge within an organization is often considered as a way to increase competitive ability (Nonaka 1994) However, KM lately has not been well received within many corporations A Bain & Company report (Rigby 2001) evaluated 25 different types of tools Of these 25 tools, KM tools ranked 24 th in satisfaction The report also shows how KM software has a relatively high rate of defection at 13% The primary reason for this is the expense (Horwitch and Armacost 2002) and the difficulty acquiring new knowledge (Davenport 1998) and knowledge dissemination Consequently, many researchers have advocated data mining of external data sources to supplement organizational knowledge (eg, Chung and Gray 1999) It has been established that programs can be written to retrieve and store data retrieved from the Web (eg, Kauffman, March, and Wood 2000) However, development and execution of these programs is quite complicated Large programming effort and high maintenance costs are duplicated across corporations to achieve similar or identical results Also, data retrieved by such techniques is static Figure 1 shows a programmer who collects data from the web, stores the data that is collected at that particular time into the corporate database, as opposed to ad hoc queries that are used inside a database to query various information in different formats depending upon the users needs (Figure 1) Therefore new information that is available for the Page 2 of 13
web will be made available with these ad hoc queries instead of the static ones that were stored Another point is that the information available outside is not stored explicitly in the database Therefore, new information is always available when queried However, as with traditional database views, SQL commands can transfer this information to a permanent storage Web Page (HTML, XML) static retrieval of web data Corporate Database Static Representation WebView (relational database) Web Page (HTML, XML) Integrated View Organization User Views (relational database) Corporate Database Dynamic Representation Figure 1: Static versus Dynamic Representation of Web data Page 3 of 13
In this paper, we develop a Structured Query Language (SQL) extension that allows corporate databases to be joined to explicit information contained on any corporate or external Web site By using existing SQL/database technology, not only are costs minimal for implementation of this new SQL extension, but users can seamlessly retrieve information from database/web joins (See Figure 1) We seek to find answers to the following questions: Can we represent a Web page to be accessible to a corporate database through SQL language extensions, and if so, how? Can a tool be developed that implements these SQL language extensions, allowing easy data manipulation of Web pages? We undertake three tasks here The first task is to design new principled extensions to the SQL language called a Webview, allowing transparent joins between database and Web data The second task is show the Webview is robust such that it can capture Web data of interest, and that identical uses of the extensions will yield identical results The third task is to develop a tool that implements these extensions as a proof of concept that the Webview extension is practical for use LITERATURE REVIEW In this literature review, we examine two different literature bases derived from Information Systems (IS) and Computer Science (CS) These include research on Knowledge Management and Data mining, SQL access of HTML, and multi-database systems (MDBSs) Knowledge Management and Data Mining Most knowledge management literature centers on identifying sources of knowledge within a company and capturing that tacit knowledge known only by one or few employees, and converting that knowledge to explicit knowledge inside a knowledge repository of some sort (Nonaka 1994) Software tools that aid Page 4 of 13
knowledge management has been reported to be expensive and of questionable value (Horwitch and Armacost 2002) Mobasher, Cooley, and Srivastava (2000) describe how pattern matching is not sufficient for data mining, useful and quality information needs to be identified from these patterns We build upon their research by creating database constructs that allow ad hoc queries of patterns, thus allowing a dynamic retrieval of data patterns that are deemed useful Chung and Gray (1999) explains how knowledge management, data warehousing, and data mining all work in conjunction with each other, and how the Web has added a new dimension to knowledge management by facilitating the acquisition of new knowledge from external sources We add to this literature by developing a language and tool that facilitates data collection and joins it to existing databases information SQL and HTML Structured Query Language (SQL) is the language used by most databases, and has been advocated as a means to access specific Web data (eg, Deutsch, et al 1998) SQL is said to be relationally complete in that it can be used to express any query supported by predicate (or relational) calculus (Codd 1972) By tightly coupling Web data to SQL using SQL extensions, we get the benefit of being relationally complete (since SQL itself is relationally complete) and are left with simpler tasks of ensuring that our SQL extension is robust in that it is sufficient to capture all Web data, including hierarchical representations (eg, XML) and relational representations (eg, links) An SQL extension also ensures that users can access Web data transparently so that Web access is accessible to any SQL-based tool 1 Thus 1 The transparency condition requires that any SQL statements, such as SELECT, remained unaltered when accessing the new Webview construct Page 5 of 13
far, no single proposed tool for data mining has addressed the challenges of SQL transparency and robustness MDBS There have been many articles that discuss SQL extensions, mainly in the area of MDBSs that can access disjoint relational SQL databases (eg, Krishnan, et al 2001) Lakshmanan, Sadri, and Subramanian (1996) advocate five required features for SQL extensions These extensions include (1) the language have expressive power that is independent of the schema where the database is structured, (2) the language must allow restructuring of one database to conform to the schema of another, (3) the language must be easy to use yet sufficiently expressive, (4) the language must provide full capabilities that are downward compatible with SQL, so that existing SQL will function properly in the presence of the MDBS, and (5) the language must be able to be efficiently implemented We build upon Lakshmanan, Sadri, and Subramanian s work by proposing AgentSQL to incorporate these five requirements into a Webview: (1) it must have expressive power that is independent of HTML, XML, or other Web-based markup languages, (2) it must allow the restructuring of Web data to conform to a database schema, (3) it must be shown to be sufficient to capture any Web data, including XML or HTML, (4) it function like existing database constructs to allow transparency for the database developer, and (5) it must be efficiently implemented SQL WEBVIEW EXTENSION FOR AGENTSQL The CREATE WEBVIEW command is displayed below for creating ad hoc queries Table 1 also summarizes the CREATE WEBVIEW clauses, which can be used in any order except that the COLUMN command must follow the applicable ROW or NESTED ROW, and the CREATE WEBVIEW command must occur first Page 6 of 13
To test the viability of the CREATE WEBVIEW, We piggy-back our engine on top of an existing Open Database Connectivity (ODBC) database manager utilizing virtual tables and corresponding SQL statements are then sent to the database engine through the ODBC manager Thus, CREATE WEBVIEW can be tested with any database that supports (or has third-party support) for ODBC (eg, Oracle, Sybase, SQL Server, Access, etc) 2 The following is the skeleton for the Webview scheme: CREATE WEBVIEW schemaname (URLExpression) USING { (SELECT statement) } ; [VARYING var1 [FROM start] [BY increment] TO finish,] [var2 [FROM start] [BY increment] TO finish, ] ] [AS] [REPLACE[S] ( findhtml, replacehtml ), ( findhtml, replacehtml ), [KEY ( htmlbegin, htmlend ) ] [TRIM [ htmlbegin, htmlend ) ] [LINK [INCLUDE { [INCLUDE { HOST PATH LEFT RIGHT BOTH HOST PATH LEFT RIGHT BOTH } ] } ( htmlbegin, htmlend ) ROW { PAGE } COLUMN[S] { ] ( htmlbegin, htmlend ), ] ] ( htmlbegin, htmlend ), ] Colname Datatype ( htmlbegin, htmlend ), Colname PAGE, Colname ROW, Colname URL, Colname KEY, Colname RETRIEVETIME, Colname ROWNUM, Colname EXISTS, ( htmlexists ), Colname2 } [NESTED [ROW] ( htmlbegin, htmlend ) [NESTED [ROW] ( htmlbegin, htmlend ) ] ] 2 Thus far, only Access and SQL Server have been tested Page 7 of 13
CREATE WEBVIEW USING LINK ROW COLUMN NESTED VARYING REPLACE TRIM KEY Indicates the start of the Webview definition Defines the Web pages that will be accessed, either via a string literal or a SELECT statement Defines URLs contained in one Web page that can be used to access another (identically formatted) Web page, allowing relational joins of linked database The INCLUDE sub-clause allows you to include parts of the current path into the link in case the retrieved link uses a relative path Defines each row between each occurrence of a beginning and ending text or HTML Within each ROW, COLUMNS are defined Defines a column within a row The column name is listed first followed by the data type and then the HTML text that precedes and follows the column value Special data types include URL, PAGE, KEY EXISTS returns a Boolean TRUE/FALSE if text appears within a row The NESTED (or NESTED ROW) clause is used to indicate that hierarchical data exists that is subordinate to the preceding ROW or NESTED clause XML fits this model, as does some HTML Hence, NESTED does not indicate multiple row definitions within the same page, but rather a single row definition where rows of data that are arranged in a hierarchical fashion Allows a loop within the urlexpression or SELECT statement of the USING clause Allows a replacement of HTML or text before processing begins, which can facilitate processing Removes all text outside boundaries defined by two strings Finds the first occurrence of a string within a page (Can be used to find a Web page identifier) Table 1 CREATE WEBVIEW Command Clauses The tool shown below in Figure 2 takes SQL statements, including the new CREATE WEBVIEW extension, and passes these statements to an ODBC database engine The AgentSQL tool shows proof of concept of the usability of the CREATE WEBVIEW statement, and use of this statement in combination with existing SQL syntax Figure 2 AgentSQL Testing Tool Page 8 of 13
Create WEBVIEW that captures Data Sets that Span Several Web Pages The following code shows how we can use the CREATE WEBVIEW AgentSQL statement to retrieve the results of an Excite search CREATE WEBVIEW excite USING ("http://srchexcitecom/d/search/p/excite/indexjhtml?s=%22oledb+and+odbc%22") TRIM ("table width=760", "targetgif") ROW ("<LI>", "</LI>") LINK INCLUDE LEFT ("http://srchexcitecom", ">") COLUMN Link VARCHAR ("href=\"", "\""), Description MEMO ("<BR>", "<BR>"), WebPage URL, Host VARCHAR ("class=size8>", "<"); SELECT * FROM excite; The above code shows how a search string ( OLEDB and ODBC ) can be used to retrieve results shown in Figure 3 (The result could be longer with different searches) The search was made specific to limit the time spent on the site) We provide an example here of a dataset spanning four Excite Web pages containing a total of 74 results One dataset spanning four Excite Web pages containing a total of 74 results is shown here Figure 3 Virtual Table Created From Spanning Excite Pages Created by the above code Page 9 of 13
Create WEBVIEW that Captures Hierarchical Data Sets (eg, XML) In order to be sufficient to the data-collecting task, the CREATE WEBVIEW statement needs to be able to retrieve hierarchical data from a Web page The code below shows the XML used for instruction in an XML and B2B class at a midwestern university <rentals> <rental custnum="12345" name="joe Teacher"> <movie name="fast and Furious" due="2002-03-04"/> <movie name="scoobie Doo and the Witches Ghost" due="2002-03-06"/> </rental> <rental name="joe Student"> <movie name="slapshot" due="2002-03-04"/> <movie name="blair Witch" due="2002-03-02"/> </rental> </rentals> The following code below shows how we can use the CREATE WEBVIEW AgentSQL statement to retrieve the results of XML similar to that shown in the code above CREATE WEBVIEW movie USING ("http://wwwndedu/moviexml") ROW ("<rental ", "</rental>") COLUMN CustNum INT ("custnum=\"", "\""), CustName VARCHAR ("name=\"", "\"") NESTED ROW ("<movie", "/>") COLUMN MovieName VARCHAR ("name=\"", "\""), Due DATE ("due=\"", "\""); The above code shows how the hierarchical nature of XML can be captured into a relational format by using the CREATE WEBVIEW statement with a NESTED clause Notice that, in the second code, Joe Student does not have a customer number This field is set to NULL using the AgentSQL tool Page 10 of 13
WEBVIEWS created via Joins to Database Tables On some data retrievals, complex behavior is required to get to the proper page The following code and relational tables (figure 4) shows how the URL of some pages can be numbered from 1 to 31 indicating the day they were developed, and also contain categories that may exist on a database We combine the power of a SELECT statement inside the USING clause to retrieve a list of categories from a database with the iteration ability of the VARYING clause and the recursive nature of the LINK clause, leading to a very powerful routine The code below was able to retrieve four categories from a database and use them to represent a dataset containing 18,086 auctions in 5 minutes on a high-speed line from over 439 Web pages 3 CREATE WEBVIEW auct USING (SELECT 'http://caymanebaycom/aw/listings/completed/category'+catid+'/day'+daynum+'page1html' FROM category) VARYING daynum TO 31 FROM 1 By 1 REPLACE ("<td align=center width=\"6%\">-</td>", "<td align=center width=\"6%\">0</td>") TRIM ("<strong>item", "completed/day") LINK INCLUDE HOST ("]</a> <a href=\"", "\"") ROW ("ebayisapidll?", "</tr>") COLUMN AuctionID VARCHAR ("ViewItem&item=", "&"), ItemText VARCHAR (">", "</a>"), Pix EXISTS ("picgif"), URL URL, SellingPrice NUMBER ("<b>$", "<"), Bids NUMBER ("<td align=center width=\"6%\">", "<"); Figure 4 Relational Mapping Created 3 WEBVIEW joins to other WEBVIEWs were also tested Since a WEBVIEW mimics a read-only table, these joins were successful Page 11 of 13
CONCLUSION In this research, we introduce a Webview, an SQL language extension that can collect and disseminate external Web data to a corporate database based on the varied information needs of the organization The tool and the SQL-language allow us to manipulate the data from the Web pages It has the ability to download enormous amount of data from large number of Web pages (see Figure 4) Since it is not explicitly stored, the data derived is not static, up-to-date information is made available when the query is made Also, data is not stored in the corporate databases in various formats to avoid redundancy and duplication of data The tools developed using this extension have the potential to impact corporate competitive strategies, supplier and client relations, and corporate research For researchers, this language and tool can allow the building of relatively cost-free databases of actual transaction, economic, and market data that exists on the Web REFERENCES Chung, H M, Gray, P, Summer 1999, Special Section: Data Mining, Journal of Management Information Systems 16 (1), 11 Codd, EF, 1972, Further normalization of the data base relational model Data Base Systems (New York) Prentice-Hall, Englewood Cliffs NJ, 1972, pp 33-64 Davenport, T H, Prusak, L, 1998, Working Knowledge: How Organizations Manage What they Know Harvard Business Press (Cambridge, MA) Deutsch, A, Fernandez, M, Florescu, D, Levy, A; Suciu, D, May 17, 1999, A query language for XML, Computer Networks 31 (11), 1155-1169 Horwitch, M, Armacost, R, May/Jun 2002, Helping Knowledge Management Be All It Can Be, The Journal of Business Strategy 23 (3), 26-31 Lakshmanan, L V S, Sadri, F, Subramanian, S N, 2001, SchemaSQL: An extension to SQL for multidatabase interoperability ACM Transactions on Database Systems 26(4), 476-519 Page 12 of 13
Kauffman, R J, March, S T, Wood, C A, December 2000, "Mapping Out Design Aspects for Data-Collecting Agents," International Journal of Intelligent Systems in Accounting, Finance, and Management, 9 (4), 217-236 Krishnan, R, Li, X, Steier, D, Zhao, L, September 2001, On Heterogeneous Database Retrieval: A Cognitively-guided Approach, Information Systems Research 12 (3), 286-303 Mobasher, B, Cooley, R, Srivastava, J, August 2000, Automatic Personalization Based on Web Usage Mining, Communications of the ACM 43 (8), 142-151 Nonaka, I, February 1994, Dynamic Theory of Organizational Knowledge Creation, Organization Science 5(1), 14-37 Rigby, D, 2001, 2001: Management Tools: Annual Survey of Senior Executives, available at http://wwwbaincom/bainweb/expertise/tools/overviewasp Page 13 of 13