A Layered Architectre for Qerying Dynamic Web Content Hasan Dalc Uniersity at Stony Brook dalc@cs.snysb.ed Jliana Freire Bell Laboratories jliana@research.bell-labs.com Michael Kifer Uniersity at Stony Brook kifer@cs.snysb.ed I.V. Ramakrishnan Uniersity at Stony Brook ram@cs.snysb.ed Abstract The design of webbases, database systems for spporting Webbased applications, is crrently an actie area of research. In this paper, we propose a 3-layer architectre for designing and implementing webbases for qerying dynamic Web content (i.e., data that can only be extracted by filling ot mltiple forms). The lowest layer, irtal physical layer, proides naigation independence by shielding the ser from the complexities associated with retrieing data from raw Web sorces. Next, the traditional logical layer spports site independence. The top layer is analogos to the external schema layer in traditional databases. Within this architectral framework we address two problems niqe to webbases retrieing dynamic Web content in the irtal physical layer and qerying of the external schema by the end ser. The layered architectre makes it possible to atomate data extraction to a mch greater degree than in existing proposals. Wrappers for the irtal physical schema can be created semiatomatically, by asking the webbase designer to naigate throgh the sites of interest we call this approach mapping by example. Ths, the webbase designer need not hae expertise in the langage that maps the physical schema to the raw Web (this shold be contrasted to other approaches, which reqire expertise in arios Web-enabled flaors of SQL). For the external schema layer, we propose a semantic extension of the niersal relation interface. This interface proides powerfl, yet reasonably simple, ad hoc qerying capabilities for the end ser compared to the crrently preailing canned form-based interfaces on the one hand or complex Web-enabling extensions of SQL on the other. Finally, we discss the implementation of the proposed architectre. 1 Introdction The trend of sing the World Wide Web as the medim for electronic commerce contines to grow. Web sers need to obtain information in ways that cannot be directly accomplished by the crrent generation of Web search engines. It is typical for a ser to obtain information by filling ot HTML forms (e.g., to retriee prodct information at a endor s site or classified ads in newspaper sites). This process can become rather tedios when sers need to make This work was done while the athor was at Bell Laboratories. Spported in part by NSF grant IRI-9404629. Spported in part by NSF grants CCR-9705998, 9711386. Appear in SIGMOD 99 complex qeries against information at mltiple sites, e.g., make a list of sed Jagars adertised in New York City area, sch that each car is a 1993 or later model, has good safety ratings, and its selling price is less than its Ble Book ale. Answering sch complex qeries is qite inoled, reqiring the ser to isit seeral related sites, follow a nmber of links and fill ot seeral HTML forms. Ths the problem of deeloping tools and techniqes for creating Web-based applications that allow end sers to shop arond for prodcts and serices on the Web withot haing to tediosly fill ot mltiple forms manally, is both interesting and challenging. It is also of considerable importance in iew of a recent srey that contends that of all the data in the Web can only be accessed ia forms [18]. Not srprisingly, the design of database systems for managing and qerying data on the Web, called webbases (e.g., in [25]), is an actie area of crrent database research. A significant body of research coering a broad spectrm of topics inclding modeling and qerying the Web, information extraction and integration contines to be deeloped (see [8] for a srey). Neertheless research on the design of tools and techniqes for managing and qerying the dynamic Web content (i.e., data that can only be extracted by filling ot one or more forms) is still in a nascent stage. There are seeral problems in designing webbases for dealing with dynamic Web content. Firstly, there is the problem of naigation complexity. For instance, while there has been a nmber of works that propose qery langages for Web naigation [27, 17, 16, 4], they are only beginning to address the difficlt problem of qerying sites in which most of the information is dynamically generated. Naigating sch complex sites reqires repeated filling ot of forms many of which themseles are dynamically generated by CGI scripts as a reslt of preios ser inpts. Frthermore, the decision regarding which form to fill ot next and how, or which link to follow might depend on the contents of a dynamically generated page. Secondly, gien the dynamic natre of the Web, in order to bild a practical tool to retriee dynamic content from Web sites, one needs to deise atomatic ways to extract and maintain naigation processes from the site strctre. Lastly, once naigation processes hae been
deried, one needs to qery the information they represent. Althogh traditional databases also proide sophisticated qery langages, sch as SQL or QBE, these interfaces are rarely exposed to the casal ser, since they are still considered too complex. Naie sers are sally gien canned qeries needed to perform a set of specific tasks. These canned interfaces sered well in the case of fairly strctred corporate enironments, bt they are too limiting for the wide adience of Web sers. A webbase wold certainly benefit from a qery langage that is flexible enogh to spport interesting types of ad-hoc qerying and yet is simple and natral to se. To address these problems, we propose a layered architectre, analogos to the traditional layering of database systems, for designing and implementing webbases for qerying dynamic Web content. In or architectre, the lowest layer, which we call the irtal physical layer, proides naigation independence becase it shields the ser from the complexities associated with retrieing data from raw Web sorces. Next p, the logical layer, which is akin to the traditional logical database layer, proides site independence. Finally, the external schema layer is fnctionally analogos to the corresponding layer in traditional databases. This analogy in terms of layering allows s to focs on deeloping techniqes for problems that are niqe to webbases, and for problems that are common to both webbases and traditional databases we can directly se the already known techniqes. Based on the databases analogy, we can readily identify that the problem of mapping the logical to the physical layer in traditional databases is similar to what needs to be done in webbases with respect to the corresponding logical and the irtal physical layer. Ths all of the techniqes deeloped in traditional databases for this mapping, sch as schema integration and mediators, can all be directly applied to webbases. On the other hand, retrieing the dynamic Web content in the irtal physical layer is a problem niqe to webbases. Unlike the physical layer in traditional databases, we hae no control oer the data sorces in the Web. Atomating retrieal of data from sch sorces, especially those generated by forms, is difficlt. Similarly, there are important differences at the external schema layer. Indeed, Web sers form a far larger adience and generally with mch wider ariation of skill leels than corporate databases sers. For them, traditional qery langages sch as SQL are too complex. At the same time, the dierse natre of the adience makes it difficlt to prepare satisfactory canned qeries in many areas. Also, preparing canned interfaces for each domain can be expensie. Ths, it is desirable to hae a qery interface that permits both ad hoc qerying and is simple to se. In brief, or approach to both of the aboe problems is as follows. Mapping the relational schema onto the raw Web reqires a calcls or algebra of some sort to specify naigation expressions that poplate the schema with data. This part is not new as other projects attempted the same (see e.g., [5]). Howeer, these approaches hae shortcomings. The webbase designer is reqired to hae expertise in the nderlying calcls, which is sally some Web-enabling extension of SQL or relational algebra. Reported experiments [26] sggest that sers resist this idea, becase the nderlying naigation langages are hard to master. In addition, gien that Web sites change freqently, maintaining manally generated naigation expressions can be an ardos task. What is different in or approach is that by separating the irtal physical layer from the logical layer we can create naigation expressions semi-atomatically, throgh an interactie process that does not reqire the ser to hae any expertise in the formalism nderlying the naigation calcls, and the webbase designer does not een need to see what the naigation expressions look like. To spport sch degree of atomation and be able to represent complex naigation processes, the nderlying formalism mst hae these properties: It mst be high leel and declaratie, as it is mch easier to create high-leel specifications of naigation processes. It mst be compatible with the formalism that nderlies databases qery langages (i.e., with relational calcls), so that it is possible to compose ser qeries with naigation expressions in order to create a single expression that wold ltimately fetch the desired answer to the qery. This is akin to the process of answering qeries against iews, where iew definition is sbstitted into the qery. If the reslting expression is still part of some declaratie formalism, then the entire qery can be optimized sing techniqes that are akin to relational algebra transformations (bt we do not discss sch techniqes here). De to the natre of the processes being modeled, the naigation calcls mst spport procedral and declaratie in the same formalism. For instance, at a high leel, the calcls shold spport statements sch as do this after doing that or do this proided that. The high-leel specification formalism mst be objectoriented. Web naigation has to deal with complex strctres sch as Web pages and forms in a declaratie enironment, and these strctres are best represented as objects. Naigation calcls expressions shold be exectable specifications themseles. In or system, we chose a sbset of Transaction F- logic [12], which to the best of or knowledge, is the only langage that spports all the aboe featres in a niform fashion. Transaction F-logic is an amalgamation of two other well-known formalisms: F-logic [14] and Transaction Logic [6]. Althogh or naigation calcls is mch more powerfl (and complex) than other proposed langages for Web naigation, the Web designer does not need to know anything abot it. Or approach makes it possible to create all necessary wrappers for the irtal physical schema semiatomatically, by simply asking the webbase designer to
! "$# %'& ( 0/abc+d=beWfWd7g g g 02[W98C=A 3\@7C8 C85+9]+? C 02c=\=rs0 89=:798 C7D;D79=>=> l-9=a 67\+i7> )+*-,-#$./- K-L$MON7P QSRUTVWQSLXSYZR ^O_& #$ `. J$& Traditional Database Architectre! "F# %'& ( 021435670 89;:+9+8=<;>79?/3@;A 9/? B C7D=9 0+hOi$67\7Dj<79/? k+3@75 )+*-,#$/E G%IH!J 021435670 89;:+9+8;C+85+9]? C3D C7D;D=97>;>l-9=A 6=\+i7> "$# t/^4_& # /- 0m4C=:+35+C=A 3\@D=C8D7<+8<;> 02nWC=A C9=o;A?C7D A 3\+@ 0pq9]>=9/?:79/? Webbase Architectre Figre 1: Traditional database architectre s. webbase architectre naigate throgh the sites of interest. We call this approach mapping by example. The irtal physical layer and the naigation calcls are described in Sections 3 and 4. For the external schema layer, we propose a semantic extension of the niersal relation interface [24, 23], which we call strctred niersal relation. We arge that this interface proides powerfl, yet reasonably simple ad hoc qerying capabilities for the end ser (e.g., a Web shopper) compared to the crrently preailing canned, form-based interfaces on the one hand and complex Web-enabled extensions of SQL on the other. The external schema layer is described in Section 6. Apart from the aforesaid sections, Section 2 introdces or layered architectre. Section 5 discsses the problems associated with the logical layer of a webbase; or implementation effort of the proposed architectre is described in Section 7; related work appears in Section 8; and conclding remarks in Section 9. 2 Architectre for the WebBase The most significant difference between a webbase and a database is the absence of the physical leel in the traditional sense. Indeed, actal data is the exclsie domain of the Web serer, and the only way the webbase can access the data is throgh filing reqests to the serer by following links or by filling ot forms. Therefore, we introdce the notion of the irtal physical database schema (VPS), which represents all the data there is to see by filing reqests to the serer. In many cases, the VPS layer cannot be constrcted completely (or we might neer know whether the known part of the VPS is complete). While the role of the physical layer in databases is to describe data storage, the role of VPS in webbases is to specify how to naigate to the arios sorces of information in the Web. In this way, VPS proides naigation independence for webbase systems and presents a database iew of the Web to the pper layers of the webbase. In this paper, we se the relational model to represent data in webbases. More details on the VPS layer appear in Sections 3 and 4. We remark that, since the main focs in this paper is qerying and naigation, we do not discss pdates and methods for data extraction from HTML pages. To the best of or knowledge, the former isse has not receied mch attention, while the latter has been researched extensiely. At the VPS layer, data collected from different sorces resides in different relations, ths semantic and representational discrepancies are likely to exist between these relations. For instance, prices cold be represented sing different crrencies and semantically identical attribtes can hae different names. These differences are smoothed ot at the logical layer of the webbase architectre, which proides site independence, i.e., independence from the specifics of the data sorces that spply data to the webbase. Frther details on the logical schema are presented in Section 5. We shold note that resoltion of semantic and representational differences between sites is not the sbject of this paper. There is a ast body of research dedicated to this topic, and we cold se the techniqes deeloped there. The top leel in the webbase architectre is the external schema layer, which targets specific application domains (e.g., sed car ads, compter eqipment, etc.) and is spported by a ser interface that permits a high degree of ad hoc qerying by naie Web sers. As mentioned earlier, for sch sers traditional qery interfaces are either too complex (SQL, QBE) or too rigid (canned and form-based). Ths, we need a qery langage that is flexible enogh to spport interesting types of ad-hoc qerying and yet is simple to se. In search of sch a langage, we resrrected the Uniersal Relation (UR) qery interface. The details of or implementation of the UR are presented in Section 6. The following example illstrates the distinctions among different leels of abstraction in a dynamic webbase. Example 2.1 (Used Cars) A webbase for sed car shopping in the metropolitan New York area might access the seeral sites, sch as newspapers (Newsday and New York Times), new car bying serices (Car Point and Ato Web), ble book price references (Kelly s), reliability information (Car and Drier) and finance (Car Finance). We present a possible set of VPS relations that can be extracted from these sites. To make the tables more compact, we se wyxz as a shorthand for the attribtes { x~}~ { ƒ q s xz. Table 1 shows examples of VPS relations for arios Web sites. The first line in the table illstrates that data for the Newsday s site might be presented in mltiple hyper-linked ˆ Newsday is a regional newspaper with circlation in Long Island and New York City. Althogh this example describes these sites fairly accrately, for illstration prposes we introdce simplifications as well as bring in featres fond in other sites.
 { {  Description VPS Leel Relations Used Car Ads Š' UŒŽ. ƒ U 2 ƒ U Ž q q U Ž - ƒšsšƒ ' ƒ. Ž Oœ' ' qž, Š' UŒ. ' U ƒ q U qÿƒ ƒ. ƒ ƒ ' ' œ' ' FŸq ƒ U ƒ q ' ' s Ž q. Uƒ ƒ ƒž Šƒ q Ž ' 2 ƒ U Ž -Ÿq ƒ U ƒ q ' ' s ƒ q. Ž ƒš.šƒ ƒ ƒ. 'ž Dealer Cars U šq SŠƒ 2 ƒ. s ƒ q U -Ÿq ƒ U ƒ q ' ' - ' S ' qšu ƒ Ž ƒšsšƒ ' ƒ. ƒž.ƒ 'š. ƒ U ~ 2 ƒ U Ž q q U Ž Ÿƒ q U ƒ ƒ ƒ ' ' S ' qšu ƒ Ž - ƒš.šq ' ƒ. 'ž Ble Book Prices ƒ q U ' 2 ƒ U - ƒšsšƒ. Ũš.Š s q q ƒ ' q U ƒž Reliability U q Šq ªƒ.«q 2 ƒ U Ž F q. ƒ q 'ž Interest Rates U qÿ FŠ.ŠŽ U Ž 2 ƒ U ' S ' qšu ƒ Ž sªqƒ ƒ U š.š ' U ƒ ƒž Table 1: VPS Leel Relations pages, and depending on the ser s reqest, data extraction might reqire naigating mltiple pages. e.g., newsday and newsdaycarfeatres. The logical leel relations for or webbase and their associated relational schemas are presented in Table 2, along with the corresponding mappings to the VPS layer. The external schema layer is represented by the following niersal relation, UsedCarUR, which contains the nion of all the attribtes of the logical layer: œ U U ƒ ƒ U Uœq ~ 2 ƒ U ƒ U Ž sÿƒ ƒ. ƒ ƒ ' ƒ qš.šƒ ' ƒ S q ƒ q U - q U ƒ - ' S ' ƒš. ƒ Ž sªqƒ '. š.š s. ƒ ƒž The mapping between external and the logical layer in the Uniersal Relation model is a rather sbtle isse. In Section 6, we show that the known approaches (e.g., [23]) are not sitable for Web applications and discss a possible soltion. Now, the qery posed in Section 1, make a list of sed Jagars adertised in New York City area sites sch that each car is a 1993 or later model, has good safety ratings, and its selling price is less than its Ble Book ale, can be expressed against or webbase as follows: œž U ƒ ƒ. Uœƒ ~ q. U -±ƒ ' ²ƒ ƒ U ƒ ' q U Ž sÿƒ q U q ' qš.šƒ ' ƒ S s q q ƒ U Ž s 'šqš. ' S ' qšu ƒ Ž sªqƒ ƒ U š.š ' U ƒ ƒž- ²' ƒ U ³µ F q q Ž q ƒ q. ¹ º ƒ U» 3 Virtal Physical Schema An important difference between webbases and traditional databases is that webbases do not control the physical data and there are limited ways in which this data can be retrieed. Gien a irtal physical schema (VPS) for a relation, the corresponding data can sally be obtained only by filling ot a form, which reqires that the ser specify ales for a certain selection of attribtes, some of which might be mandatory and some optional. In fact, there might be seeral alternatie sets of optional/mandatory attribtes per relation that limit the scope of data to be retrieed. In addition, we mst specify the naigation process that needs to be exected in order to get the data. This process is represented sing Naigation Calcls, which is described in the next section. Therefore, for each relation schema in the VPS layer, there is a qadrple, called a handle, represented as follows: H = ½ mandatory-attrs, selection-attrs, R, expression¾ The set of mandatory attribtes specifies the minimm information that the handle needs in order to inoke the naigation calcls expression (the forth component) and retriee the reqisite data. The set of selection attribtes specifies the additional attribtes that might be also specified. These additional attribtes are sed by the expression and are eentally passed to the arios Web serers who, presmably, se these attribtes to retrn more specific answers. For conenience, we assme that mandatory-attrs selection-attrs. There can be seeral handles for the same relation. Different handles for the same relation mst se different sets of mandatory attribtes. Howeer, different handles can hae the same sets of selection attribtes and the same naigation expression (for instance, the same HTML form might hae two alternatie sets of attribtes; at least one of them mst be filled in order to get a reslt). We assme that all handles for the same relation agree with each other: if À Á ½4{ $ ¾ and À ÄÁ ½W{ - ¾ are two handles for the same relation and we specify concrete ales for a set of attribtes  sch that ÆÅ ÇÂÈ ÇÂ, then handles ÊÉ À and À retrn the same reslt. Table 3 shows the sets of mandatory and selection attribtes for some relations in the VPS of Example 2.1. The first colmn in the table lists relation schemas, the second colmn shows mandatory attribtes for each schema, and the third shows the optional attribtes (= selection-attrs mandatory-attrs). 4 Naigation Calcls Naigation maps. The basic data strctre that enables atomated access to irtal relations residing in the VPS of a webbase are the naigation maps for the participating sites. Intitiely, a naigation map codifies all possible access paths that a site presents for poplating a irtal relation. A naigation map is a labeled directed graph (see Figre 2) where the nodes represent the strctre of static or dynamic Web pages, and the labeled edges represent possible actions (i.e., following a link or filling ot a form) that can be exected from a dynamic page. Or naigation maps are closely related to the Web schemes of the Aranes project [25, 5], bt or modeling of the Web is process-oriented, which facilitates creation of the naigation expressions from
Û ô õ Logical Leel Relations Definitions U q ƒ q q S Ũ ' 2 ƒ U ƒ q. Ž - ƒš.šƒ ' q. sÿƒ ƒ. ƒ ƒ ' Už Ë Ì2ÍÎsÏ Ð2ÎWÑÒÓÏ Ì2Ô/ÕÖWÍÒ/ÖsÏ ÓÍ2ÖØ4ÎÓWÙƒ /Š' UŒŽ. ƒ U yú Š' UŒŽ. ƒ U ƒ ƒ U Ÿƒ ƒ U q ƒ ' ž Ë Ì2ÍÎsÏ Ð2Î4Ñ2ÒÓÏ Ì2Ô/ÕWÖÍÒ/ÖsÏ WÓ2Í2ÖØ4ÎÓ4Ùƒ /Šq q ƒ ž q ƒ q ' 2 ƒ U s ƒ ' q U Ž - ƒš.šq ' ƒ. sÿq ƒ U ƒ q ' ž Ë Ì2ÍÎsÏ Û Ð2ÎWÑÒÓÏ Ì2Ô/ÕÖWÍÒ/ÖsÏ ÓÍ2ÖØ4ÎÓWÙƒ W U šq SŠƒ 'ž Ë Ì2ÍÎsÏ Ð2Î4Ñ2ÒÓÏ Ì2Ô/ÕWÖÍÒ/ÖsÏ WÓ2Í2ÖØ4ÎÓ4Ùƒ Sƒ 'š. '. ž '. ' ƒ q U Ž 2 ƒ U - ƒš.šq. šsš s q q ƒ U ƒž Ë Ì2ÍÎsÏ Ì2Ô/Õ4ÜÑ/ÖWÑÔÕÏ Ý2Ý2Ð2ÎOÑ2Ò2Ó' ƒ q U ' ž q ƒ ƒ. ƒ. q 2 ƒ U $ q U ƒ U q 'ž. q Šƒ ªq.«ƒ FŠƒ ƒ ƒ ƒ. 2 ƒ U Ž $ ' F ' ƒšu ƒ ªqƒ ' U ' š.š s U q ƒž. qÿ SŠ SŠŽ U Table 2: Logical Leel Relations VPS Mandatory Optional Š' UŒŽ. ƒ U /± S ' Ž s± šu ƒ ƒ ' ƒ q U qš.šƒ ' ƒ S Oœ' ' qž ± S ' ± šu ƒ q Š' UŒŽ. ƒ U ƒ ƒ U Ÿƒ ƒ U q ƒ ' œ' ƒ $Ÿƒ q U ƒ ƒ ƒ ' s Ž q. q ƒ ƒž œ' ' Šƒ q Ž ' /± S ' Ž s± šu ƒ ƒ ' sÿƒ ƒ U q ƒ ' ' s ƒ U Ž ƒš.šq ' ƒ. 'ž ± S ' ƒ q U ' /± S ±'šu ƒ ƒ qš.šƒ. ' š.š s q q q q U ƒž ± S ' Ž ± šu ƒ q ƒš.šƒ '. š.š Table 3: Virtal Physical Schema naigation maps. Mapping the irtal physical schema onto the raw Web reqires a calcls of some sort. One obios candidate wold be the relational calcls or algebra, extended with Webspecific primities (and some other known extensions, like the nnesting operator of Ulixes [5]). The Aranes and the Ariadne projects [5, 15] take this approach. Howeer, these formalisms are not powerfl enogh to express complex naigation processes on the Web. For instance, as shown in Figre 2, a naigation process to access the sed car ads in the classified section at the site Þ Þ Þàß áeâ Þäã'å æžçäß+è'éqê reqires following a link (link(ato)), filling ot a form (form f1(make)), then making an if-then-else choice depending on the reslting page if the page is not a data page, another form (form f2(model,featrs)) will hae to be filled ot. The length of the seqence is not fixed. It is sally one or two, depending on the nmber of answers that match the initial qery. Once the final data page is reached, an iteration to collect data is needed (repeatedly hitting the More btton). Examples like this and or experience with other, more complex, sites shows that naigation processes are best represented sing a calcls that allows recrsion and has the notion of ordering of eents. In addition, the calcls mst deal with complex strctres, sch as Web pages, forms, etc., which are best represented as objects.ë Unlike other projects that deal with naigation processes on the Web, we do not inent yet another, new naigation algebra or calcls. The calcls that satisfies all the reqirements stated aboe is actally well-known: it is a sbset of serial-horn Transaction F-logic [12], a natral cross between Transaction Logic [6] and F-logic [14]. In fact, the Florid system [9], based on F-logic, has proed ì Obsere that the ser-leel iew of the database is represented sing the relational model. Howeer, the nderlying naigation process (which is inisible to the end ser) is based on the object model, since it has to deal with Web pages and other objects, which are not part of the ser iew. link(l1) new_car_dealer UsedCarPg collectible_cars sport_tility link(more) form f1(make) link(listing) newsday link(ato) link(l3) link(l4) form f1(make) cardata (make, model, year,...) newdaycarfeatres(featres, pictre) form f2 (model, featrs) link(car_featres) carpg Figre 2: Naigation map for Newsday Classified Car Ads to be ery sccessfl for Web applications. Becase Florid lacks the Transaction Logic component, it is not sitable to be sed as a calcls for encoding naigation processes. The object model. F-logic extends classical logic by making it possible to represent complex objects on a par with traditional flat relations. A naigation map is a collection of F-logic objects, sch as the following object that represents one of the forms to be filled ot at the Newsday s site: S. ~. 'šu S îíq ƒ. šsš~ï ôoõ 'šu S Äðñ 'šu F ò qï. óð õ.ü. ' öø Ž SŠ ŠŽ q q q. ùqª ª ú üž û Ž þ šu ÿ𠚃 S.Šƒ ' U ƒšu q 'ð2ð S W šu ƒ ƒ šs ƒ š.š' q qð2ð F ƒ ƒ U Uš.ƒ U ÿð Œ ŒqŒ ú Š' UŒŽ S ' U ú š$ In the first line, ä qz x says that the object ä qz belongs to class action. It has attribtes form and sorce. The attribte form has the ale form01,
& & ) & Sƒ ƒ UŠƒ Uœƒ ' Ž / Ž. Ž -ƒ ƒ qž Crrent URL of browsing process PID ƒ. ' š.š~ï Declaration of Class Action š. ' '. U ƒ FŠ ~ s 'šu F Action can apply to a form or a link šsƒ U ƒ ' Page where the action belongs '. q ƒ Œ' U U q Where this cold lead s 'šq. U q! q. ' ƒ. Œ'. ' U ƒ Method to execte action Sq S ~. 'šu S í' q. š.š Form fillot is an action 'šq qš.œ ƒ SŠ í' q. š.š Following a link is an action Œ' U. ƒ ï Declaration of Class WebPage. q q ƒ '!ºƒ ƒ ; URL of page '. '. q FŠƒ ; Title of the page Uš.Šƒ ƒ UŠq!. SŠƒ ; HTML contents of page q. š.š!" U q. š.š List of actions fond in the page ' U ƒ ' U ƒ í7íuœ' U U q The class of data Web pages is a sbclass of Œ' U. ƒ ' U ƒ ' U ƒ ï ûq ' ƒ. q ƒ q U Ũš.Š Data pages hae a data extraction method ƒ SŠU ï Declaration of Class Link Š' $. q SŠq ; Name of link. q q ƒ '!ºƒ ƒ ] URL of link 'šu F!ï Declaration of Class Form S!ƒ ' ; CGI script s URL associated with this form CGI inocation method Ž þ šu Uþ ;.Šƒ ' U ƒšu q ". q q S ƒ ƒ ; Mandatory attribtes of this form šs ƒ š.š' q " U q S qq ƒ ; Optional attribtes of this form S ' U ƒ " U q q! ' q. ƒ S ] State of form (set of attribte-ale pairs) U q! q. ' ƒ. ï Declaration of Class AttrValPair. q q ù ƒð. SŠƒ ; Name of the attribte part ' ƒð ŒŽ. q ƒ U ; Checkbox, select, radio, text etc. q '.. 'ð#. ' '. ; Defalt ale of the attribte «ƒ q.' ƒð#. ƒ.! The ale part Figre 3: Common WWW Data Strctres Represented in Naigation Calcls Š' UŒŽ S ' U /± F Ž ± šu ƒ ƒ ' ƒ q U ƒš.šƒ ' q. œ' ' qž%$ Find car ads at Newsday: Š' UŒ. ' U q ú ƒ. šsšž ¹íq ƒšq q qš.œ ƒ SŠU ï š. ' S ð q SŠ.q 'šqž Follow link(ato) 'šƒ. žð œ. ƒ ƒ U U ƒ! to sed car ads; œ. ƒ ƒ U U ƒ ú ƒ. Ũš.ŠŽ íƒ Sq. ~ S 'šu F!ï š. ' S ðñ ƒšu S! Fž 'šq. /± S ' ƒžð ƒ. ƒ ' Fill form(f1) sing Make; 2 ƒ. ƒ íq '. ' U ƒ ï û q ' ƒ. ƒðð Uq 2 qš.šƒ ' ƒ S s ƒ q U sœƒ ' qž( Either extract data, 2 ƒ. ƒ ú ƒ S š.šž í Sq S ~. 'šu S!ï š. ' '. 'ð 'šu S! *qž or fill form(f2) 'šq. /± šu q ƒ qžð q U ƒ *+ sing Model, q U ƒ * í ' U ' U ƒ Žï Uûq q ' ƒ S 'ð2ð q U 2 ƒš.šƒ ƒ ƒ. ƒ q. Ž Oœ' ' qž( then extract data Figre 4: The Naigation Process of Retrieing Used Car Adertisements from Newsday Site which represents the form to be filled ot. form01 is itself a complex object with for attribtes: cgi and method are single-aled, and mandatory and optional are mlti-aled. The attribte sorce of the object sbmit form represents the page to which the action belongs. In addition, the object has a method, doit (defined below), whose prpose is to execte the action. Figre 3 presents the schemas (called signatres) of some of the objects we se to model naigation maps. The dobleshafted arrows, and,(, (as opposed to - and -.- in the preios example) signify that these expressions declare the types of the attribtes and methods rather than their states. Naigation expressions. F-logic proides a declaratie calcls for representing complex objects on the Web, bt to model naigation processes one needs a formalism for the representation and seqencing of actions. These facilities are proided by Transaction Logic [6], a conseratie extension of classical logic, which is sitable for representing complex declaratie processes that both qery the nderlying database state and pdate it. The following sbset of serial-horn Transaction Logic complements the sbset of F- logic described aboe and proides an expressie naigation calcls for enabling access to raw Web data. For the prpose of this paper, it sffices to explain the informal, procedral reading of some of the connecties of
ë / 1 1 0 1 4 1 4 Transaction Logic. Underlying the logic and its semantics is a set of database states and a collection of paths. A path is a finite seqence of database states. For instance, if.ßß+ß '!0 are database states, then ½. '.ßß+ß '!0¾ is a path of length. Jst as in classical logic, Transaction Logic formlas assme trth ales. Howeer, nlike classical logic, the trth of these formlas is determined oer paths, not at states. If a formla,, is tre oer a path ½. Uß+ßß+!0¾, it means that can execte starting at state. Dring the exection, the crrent state will change to,,..., etc., and the exection terminates at state. Procedrally, a Transaction Logic formla can be nderstood as a transaction or a qery (depending on whether it changes the database state or not). Semantically (and procedrally) these formlas hae seeral common attribtes of database transactions, sch as atomicity and isolation. With this in mind, the intended meaning of the new connecties of Transaction Logic can be smmarized as follows: 13254 means: execte then execte. 17684 means: execte or execte non-deterministically. This connectie is sefl for specifying alternatie exection branches in a naigation process. Figre 4 shows the naigation process to extract car ads from the Newsday site. Here we se path expressions as shortcts for longer F-logic expressions, as described in [14, 13]. For the benefit of the reader who is not flent in Transaction F-logic, we annotated each clase in Figre 4, so the meaning of the naigation expression shold be self-explanatory. Naigation expressions do not always need to be that complex. For instance, naigating to the NewsdayCarFeatres page, which is part of or VPS, can be achieed sing the mch simpler expression: Šƒ UŒŽ. '. ƒ ƒ U qÿq ƒ U ƒ q ' œ' ' ' FŸƒ ƒ U Uƒ ƒ ' ' s Ž. ƒ ƒ qž9$ /ª' U '. ƒ í ' U ' U ƒ qž-ú q. š.š ï š. ' S ¹ð ƒ SŠ žï Š $ óð ƒ. qÿƒ ƒ U Uƒ ƒ ' 'šƒ. žzð ï U q ƒ ' q ð œ' ƒ ûq q ' q. yð q U Ÿƒ ƒ U Uƒ ƒ ' ' Ž q S ƒ ƒ ƒž( : This expression says that to get to a car featres page, we mst first get to a data page (DataPg) that has a link called Car Featres, follow this link, and then extract the featres from the page. Of corse, in a bigger system, we wold hae to qalify this initial page een frther to aoid mis-naigation. The interesting point here is that the page denoted as DataPg is not an entry point to any naigation process that a reglar Web ser might perform. Indeed, as seen in Figre 4, one has fill ot one or two forms to reach this page. Howeer, this is not a concern at the VPS layer. It is a job of the logical layer, described in Section 5, to order joins in sch a way that the relation newsday of Figre 4 is compted first (to retriee the desired page, DataPg). Een if the aboe Transaction F-logic expressions look a bit complex to the reader, the most important aspect of or webbase architectre is that nobody, except the system bilder, needs to eer see these expressions. It is easy to see that the aboe expressions closely mimic the strctre of the naigation map in Figre 2 and, in fact, they can be deried atomatically directly from that map in linear time in the size of the map. De to space limitation, we do not present the translation algorithm, becase this wold reqire that we spell ot the strctre of the naigation maps in mch greater detail. Details are gien in the fll ersion of this paper.; It is important to realize that the translation from the map to the calcls expression has been greatly facilitated by: or process-oriented object model, whose objects correspond to nodes and links of the naigation map; the fact that the F-logic component of or naigation calcls natrally spports this object model; and the Transaction Logic component that represents the process strctre encoded in the naigation map. Finally, once the translation is done, the reslting naigation expressions can be directly exected by a Transaction F- logic interpreter when ser qeries posed against the external schema leel of the webbase eentally trn into qeries against the VPS layer. 5 Logical Layer The VPS layer proides a relational iew of data that can be retrieed from a Web site, thereby hiding naigation details. In contrast to this, we se a logical layer to proide a niform interface to data arriing from mltiple sorces. By separating these layers, we achiee site independence. This means both independence from the differences in ocablary and representation sed by different sites as well as complete transparency with respect to where the data is coming from. Table 2 shows a possible mapping of Logical relations into VPS relations. While VPS layer has eight relations that shield the ser from naigation details, the fie logical relations in the example show a iew of the Web data that is completely transparent with respect to the location of the data sorce. In this paper, we are not concerned with the isses pertaining the mapping of the logical layer onto VPS. This mapping can be done sing conentional techniqes (e.g., relational algebra, or Datalog rles) or we cold se more adanced techniqes, which might offer certain adantages in the Web enironment (e.g., [8, 19]). Howeer, one isse related to this mapping mst be addressed. The problem is that nlike traditional databases, VPS relations can only be accessed by spplying ales for certain sets of mandatory attribtes. Since logical relations are mapped onto the physical ones, it is clear that they also can be accessed only by proiding ales for certain attribtes. The process of determining these sets of attribtes is called binding propagation (becase, in abstract terms, sets of mandatory attribtes in HTML forms correspond to ariable bindings in programming langages). The problem of binding propagation has been wellstdied in the literatre (see e.g., [7, 29]). In the following, < The fll-ersion of this paper aailable at http://www-db.research.belllabs.com/sers/jliana.
G Á Á Á? M G G { Á G G { G { we propose a mch simpler description, which also differs from= other works in two respects: (1) it handles not only conjnctie qeries, bt also all relational algebraic qeries; and (2) instead of deriing bindings for a gien qery on the fly, it statically determines all allowed bindings for each logical relation. Let > denote a relational algebraic expression oer VPS relations, and we need to determine the bindings (or sets of mandatory attribtes) for the reslting relation of this expression. The binding propagation algorithm can be described by the following rles, each corresponding to one of the allowed relational operators: Let >, where? is a VPS relation, if { is a binding for?, then { is also a binding for >. Let > Å or >, where A@ and are relational expressions oer VPS relations, if { is a binding for and { is a binding for, then { Å is a binding for >.B Let > ÁDCFE IH or > ÁDJLK IH, if { is a binding for then { is also a binding for >. Let >, if M { { are bindings for, respectiely, then { Å N@ É HOH and { Å 9@ É HPH are both bindings for >. Here É denotes the set of common attribtes of the relation schemas for and. From these rles, it is also easy to derie an algorithm for join ordering nder the gien set of bindings, i.e., an ordering.ßß+ß 0 that garantees that for each (QSRTURV ), all mandatory attribtes of XW W"Y belong to the nion ZO[ Z Å. Clearly, the existence of sch an ordering is necessary and sfficient for a join to be comptable nder the gien set of mandatory attribtes. Howeer, in the presence of mltiple sets of mandatory attribtes per VPS relation, sch an algorithm wold be exponential. In fact, [29] shows that the problem is NP complete in this case. To illstrate the aboe binding propagation algorithm, consider the logical leel relation classifieds from Example 2.1. Since { x } is the only mandatory attribte of the relation newsday and \ zƒ is the only mandatory attribte b of!]^ Žx`_wyxz a qx` zq, by the join rle aboe, { x~}~dc trns ot also to be the only mandatory binding for ]XÜ x`_ ]XÜ x`_wyxžza qx`!zƒ. Similarly, { x~}~ is the only mandatory attribte of nytimes. Therefore, by the nion and projection rles, b { x~}~dc is the only mandatory binding for classifieds. 6 External Schema Casal sers qery the webbase throgh the external schema. Traditionally, end sers hae been gien access to limited interfaces that allow only a fixed set of canned qeries. These e Here we assme that the ser wants all aailable answers to the qery. If the ser is willing to accept only some aailable answers becase she does not want or care to fill ot all the reqired attribtes in a form, then we cold define a relaxed nion. In a relaxed nion, both f ˆ and f (separately) wold be acceptable bindings for g. canned interfaces sered well in the case of fairly strctred bsiness enironments, bt, as remarked earlier, they are too limiting for a casal Web ser. On the other hand, more flexible qery langages, sch as SQL or QBE, are too complex. In search of a sitable qery interface for webbases, we resrrected the idea of the Uniersal Relation (UR) [24]. The basic idea is simple and appealing. The ser is presented with a list of all attribtes that might be of interest for a particlar application domain. To pose a qery, the ser simply points to a set of otpt attribtes and imposes conditions on some other attribtes. This is it: no joins, sheer simplicity. Of corse, to realize sch an agenda, the system (and the ser) mst know what sch a qery exactly means, and the nderstanding of that meaning by both the system and the ser mst coincide. Simplistically, the semantics of a niersal relation qery is explained as a natral join of the nderlying relations at the logical layer, which coer the otpt and the selection attribtes specified in the qery. Moreoer, the join mst be lossless. Losslessness is reqired becase this is a formal analog of the common sense idea of connections between concepts that make sense. Underlying this idea are two basic assmptions: 1. The niqe relationship assmption: The relationship between any gien sbset of attribtes in the niersal relation schema is nambigos and niqe; and 2. The niqe role assmption: The name of an attribte nambigosly determines the role of that attribte. The first problem arises een in ery simple schemas that contain jst for attribtes. For instance, a cstomer and a bank might be connected becase the cstomer has an accont in the bank, a loan, or both. Which one did the ser hae in mind when she selected Bank and Cstomer as otpt attribtes? A nmber of soltions were proposed to address the first problem, which range from restricting the topology of the nderlying logical schema (e.g., acyclicity [21]) to additional layers of semantics (e.g., Maximal Objects, Window Fnctions [23, 22]). Unfortnately, on the Web, we cannot assme the ery basic lossless join semantics for UR, since we cannot een assme any dependencies (join, fnctional, or mltialed) on which the ery idea of losslessness is based. Nor can we se most of the approaches to enforcing or relaxing the niqe relationship assmption, becase these approaches rely heaily on the se of constraints. The second problem, the niqe role assmption, was assmed to be solable by simple renaming of attribtes. Howeer, this soltion was neer thoght to be practical and may hae been responsible for the general lack of enthsiasm for the UR approach. In or attempt to adapt the UR as a Web interface, we kept the basic idea of a simple qery interface, bt rejected the lossless join semantics and the two niqeness assmptions. We call this approach strctred niersal relation. The basic idea is to replace losslessness and constraints with
- Á l Á M M r Û Û Û Û compatibility rles. A compatibility rle has either the form ih.ßß+ß or the form ih.ßß+ß -kj. In the first case, the rle says that if yo already joined ih.ßß+ß then joining with also makes sense. This is or poor man s lossless join reqirement. The second rle is really a constraint. It says that if we hae already joined h.ßß+ß, then joining with wold create an incorrect relationship (in the UR model, sch connections are known as naigation traps ). With these constraints, we can formlate the semantics of a qery as follows: Let Q be a qery that mentions the set of attribtes l Uß+ßß+ 'lnm. Then the semantics of this qery is said to be the join ßß+ß 0, where.ßß+ß 0 is a minimal (with respect to inclsion) sbset of logical relations that satisfy the compatibility rles, and contains all attribtes in l.o This is essentially or analoge of the maximal objects approach [23]. If there are seeral maximal objects coering the qery attribtes then we take the nion of reslts obtained from each object. Depending on the exact strctre of the compatibility rles, algorithms with arios efficiency can be constrcted. For instance, if the rles are of the form -qp, then we hae a restricted join-ordering problem mentioned in Section 5. To address the problem of niqe name assmption, we propose to organize the attribtes in the UR into a hierarchy of concepts. Each concept is a relation schema whose attribtes are concepts of a lower layer. As shown in Figre 5, the top layer in this hierarchy is the niersal relation itself, and the concepts are the attribtes of that relation. Dealers Classifieds Lease UsedCar(Car, Price, Contact) Loan Interest(Car, Rate) Fll Coerage Retail Liability Vale Insrance(Car, Cost) UsedCarUR ( Car, Price, BBPrice, Rate, Contact, Cost) TradeInVale BleBook(Car, BBPrice) Figre 5: Concept Hierarchy for the Used Cars UR Example 6.1 (Concept Hierarchy) The concept hierarchy describes the following: (1) A sed car is either adertised at a dealer site or it is in the classified section of a newspaper site; (2) The ble book price of a car can either be its trade-in price or its selling price; (3) The interest rate for a sed car depends on whether it will be financed or leased; and (4) the insrance rate depends on whether it proides fll or liability coerage. The idea behind concept hierarchies is that the ser starts by selecting top-leel concepts and then proceeds to sbconcepts. This makes it possible to bild qeries incrementally, by restricting the search to arios sbs Compatible means that for eery t3wlxzy, there is a rle {F } ~ 9 sch that {F } ~ƒ5 ˆ ˆ ˆ ˆ 9 Š ˆ' ; and there is no rle {L '}~ Œ sch that {L '} ~`Ž^ ƒ ˆ P ˆ ˆ ˆ %. concepts and to specific ranges for attribtes at the leaf leel. The niqe name assmption is not an isse here for the ser or the system since both can see the entire concept hierarchy to which the attribtes belong, and the relationships among concepts and attribtes are defined by the compatibility rles. We beliee that webbases will be designed for application domains (sch as cars, jobs, hoses) by the experts in those domains, and designing concept hierarchies and compatibility constraints is a feasible task for them. We illstrate these ideas with an example, leaing ot the details de to space limitation. Example 6.2 (Strctred UR in Action) The following compatibility constraints specify the meaningfl connections for the UsedCarUR of Example 6.1. Compatibility Constraints Semantics ƒ q q q q.. ÿð ' ƒ ƒ U We cannot lease a car from its owner ' ƒ q U óð Ÿ q ƒš.«ƒ ' U q Leased cars hae to be flly insred œ U U ƒ. ð q ' U q SŠ ' q.' Trade-in ales are not applicable Consider the following qery: make a list of sed Jagars adertised in New York City area sites sch that each car s monthly payments are less than 1,000 dollars, and its selling price is less than its Ble Book price. This qery can be expressed as: œ U U ƒ. œq ~ q U ' U s± šu ƒ q ²' ƒ U Ž ƒ q U q q ƒ U Ž. ƒ Ž ' SŠŽ ƒšƒ S 'ž- ƒ q. º q ƒ q U / ƒ ' q U Ä. ƒ š SŠŽ ƒšƒ S 'žfœ +*X Fò òqò Using compatibility constraints, or algorithm generates the following maximal objects and the corresponding relational expressions: ª' ƒ q U óúž ' ƒ q U óúºÿ ' q Ú ' U ' ƒ q ªƒ ƒ q ' óúž šq.š Ú Ÿ q Ú ' ' q ' q ªƒ ƒ q ' óúž šq.š Úž Ž. ƒ. q Ú ' ' ƒũ q q ƒ q q S Ũ óús 'šq.š ÚS. Ž q. q Ú ' U ' ƒ q q q ƒ q q S Ũ óús 'šq.š ÚºŸU q Ú ƒ ' ƒ q» Now assming the existence of a mapping fnction from external schema relations to the logical leel, maximal objects made p from the UR relations can be translated into conjnctie qeries oer logical leel relations. Once translated, these qeries can be optimized and ealated by standard qery ealation techniqes. 7 Implementation and Experiences We hae implemented the most essential components of two of the modles in or webbase architectre: the naigation map bilder and the qery ealator. In what follows we describe the ideas nderlying or implementation. Naigation Map Bilder. We se the methodology of mapping by example to extract the naigation maps from Web sites. The main idea behind mapping by example is to discoer the strctre (or schema) of a site while the
Ÿ webbase designer moes from page to page, filling forms and following links. There are two key components to this methodology: (1) discoery of access paths to the data of interest; and (2) extraction of action objects (see Figre 2). In order to bild a practical tool, there are two important reqirements: the mapping process shold be as transparent as possible to the webbase designer (its operation shold closely mimic the browsing experience); and the mapping tool mst be portable (e.g., it shold not reqire modifications to the browser). The naigation map bilder achiees these goals by sing JaaScript eents to captre browsing actions. Actions are dynamically intercepted by JaaScript handlers (inserted into the retrieed pages by the map bilder), and are added as edges of the naigation map. When a new page is loaded into the browser, it is parsed, and a new node corresponding to the page is inserted into the naigation map. In order to gide the designer, an applet displays a graphical representation of the naigation map as it is being constrcted, highlighting in the map the node corresponding to the page displayed in the browser. The map bilder parses an HTML page and generates a set of F-logic objects (as detailed in Section 4). It extends PiLLoW, a pblicly aailable Prolog-based system, to extract all necessary information for following links and sbmitting forms fond inside the page. Since not all information is stored in the HTML object strctre (e.g., labels denoting the domain ales of some attribtes and attribtes defined throgh a set of links) we take adantage of HTML tags and anchors and other strctring primities (e.g., tables, enmeration) to extract sch information. For forms, the extractor is also able to infer which attribtes are mandatory from their widget (i.e., if an attribte is represented by a radio btton we can safely assme it is mandatory), as well as other information sch as the domain of attribtes (e.g., from the ales of a selection list), maximm length (e.g., for a text field), defalt ale, to name a few. For data pages, as described in Figre 3, we assme that the designer proides an extraction script. Of corse there are instances where inpt from the designer is needed. For instance, the designer has to indicate whether a text field is mandatory. Also, it is not ncommon in forms for attribtes to hae rather cryptic symbolic names in these cases (to facilitate sbseqent qerying) the ser might want to proide a more informatie name. There are also instances where attribtes are implicitly defined throgh a set of links (e.g., a list of links with car models). Since this kind of attribtes is not part of a form, the designer has to specify a name as well as the set of links that relate to this attribte. It is worth pointing ot that in many instances or parser is able to find these links by considering their HTML enironment (e.g., a table), or the ser can proide additional Since bilding maps is an incremental process, or tool checks whether actions and Web page objects are new before adding them to a map. http://www.clip.dia.fi.pm.es/software/pillow/pillow.html hints. We are crrently bilding a graphical ser interface to simplify the inpt of sch information by the designer. We sed or initial prototype of the map bilder to map arios sites. To gie an idea of the degree of atomation achieed, for the Newsday site depicted in Figre 2, all objects that describe the naigation map (85 objects with oer 600 attribtes in total) were atomatically extracted. Less than 5% of the information in the map was added manally, which consisted of 10 to 12 facts to standardize attribte and domain ale names. For other sites sch as New York Times and Daily News, the ratio was similar. The process of mapping each of these sites took on aerage 30 mintes. It is worth pointing ot that the main problem we face while mapping sites is the presence of falty HTML, in which case the parser needs to be able to recoer from the ill-formed docments. Some points are worthy of note with respect to the maintenance of sch maps. Modifications to Web sites can be atomatically detected by periodically comparing the naigation map against its corresponding site, or when the corresponding naigation process fails. Whereas certain strctral changes sch as the addition of a new form attribte reqire manal interention, others can be applied atomatically (e.g., the addition of a cell in a selection list). Since we first bilt naigation maps for car-related sites, we hae noticed qite a few changes to these sites. For example, in Kelly s Ble Book (www.kbb.com) new links with information abot 1999 cars hae been added. In order to pdate naigation map, we only had to naigate throgh the modified pages, a process that took a few mintes. Qery Ealator. As described in Section 4, once a map is bilt, naigation expressions are atomatically generated. This process reqires a simple traersal of the naigation map, and ths can be done in linear time in the size of the map. Indiidally, each expression can be seen as a shortct to retriee data from a Web site. Instead of filling forms and following links, one can simply specify a set of attribtes and execte the appropriate naigation expression (e.g., for the qery SELECT make,model,year,price,contact WHERE make=ford AND model=escort), execte newsday(ford,escort,year,price,contact) (described in Figre 4). It is worth pointing ot that as a byprodct, the process of retrieing sch data is made faster since dring the exection of a naigation process no extraneos objects sch as figres and Jaa animations are retrieed. Naigation expressions are processed by the Transaction F-logic interpreter, which translates them into logic programs that are exected by a dedctie engine, the XSB system. On top of XSB, we se the HTTP library proided by PiLLoW to follow links, sbmit forms and retriee docments from the Web. In order to combine information from different sites (or maps), the attribte names and their domains mst be standardized. In or crrent implementation, one mst http://www.snysb.ed/ sbprolog
manally specify these mappings. If a mapping is not proided for a certain attribte name, we employ fzzy matching techniqes, which eidently are not fll-proof and may lead to errors. We intend to incorporate techniqes from mediator systems sch as [10, 30] to address this problem. We hae bilt naigation maps for a nmber of sites. To gie an idea of the complexity of the sites and qery exection times, below we show the nmber of pages naigated and (some of the best) ealation times for the qery SELECT make,model,year,price WHERE make=ford AND model= escort oer 10 car-related sites. These timings indicate that to ensre acceptable response times when qerying a large nmber of sites, we may need to se techniqes sch as parallelization and caching. It is worth pointing ot that a significant portion of the time in qerying is spent not only in fetching, bt also parsing the Web pages. We beliee these times can be greatly improed if a faster parser is sed. Site # of pages cp time elapsed time Uƒ 'š. '. * *'ú Žú òqò q qþ' q q ƒ * Úú *U ª ú * ù ²ƒ ƒ ò ú * ª ú ò q U ' «'Ũ UŒŽ ú ªU FòŽú7 Fò ùƒ UŒq² šu. ƒª ƒ. ú q q ú «qò q U ªƒ S«ƒ ú ª +* ú7 Uƒ 'š ƒšsšqš' '. «ú * ú «qò ùƒ UŒŽ. '. Úú òž ú ««²'.þ šqš q U ª ú+ * FòŽú ƒ q U ª *'ú ª «Žú q 8 Related Work The problem of retrieing data from and qerying Web sorces has receied considerable attention in the database literatre (see [8] for a srey). Managing information on the Web encompasses seeral tasks that inclde locating interesting data, modeling Web sites, extracting and integrating related information from mltiple sites. Web qery langages sch as W3QL [16], WebSQL [3], WebLog [17], and Florid [9] address the problem of finding and retrieing data from the Web. They improe on search engines by combining textal retrieal with strctre and topology-based qeries. These langages iew the Web as a collection of nstrctred docments organized as a graph, and sers can declaratiely express how to naigate portions of the Web to find docments with certain featres. Conceptally, these langages are eqialent to arios sbsets of or naigation calcls. More importantly, howeer, these are fairly sophisticated qery interfaces designed to be sed by a fairly sophisticated ser. In contrast, een thogh or naigation calcls reqires an een greater degree of programming expertise, it is not designed to be sed by a programmer. Instead, naigation expressions are generated atomatically from the map. ˆ± The times were collected on a Sn Ultra workstation, with dal 330 MHz processors, 1 GB of memory, and Solaris 5.6 operating system. Web information integration systems [20, 5, 2, 10] are more closely related to or work in that they try to present the Web throgh a nified database interface. The Aranes project [5] proides a rich model (ADM) to describe both the topology and the contents of Web sites. Their concept of the ADM scheme is analogos in many respects to or naigation maps. Naigation processes to poplate database iews are expressed in a newly deeloped declaratie algebra, called Ulixes. Ulixes is intended to be sed by a database designer to create Web iews for the end ser. In contrast to this, we se a well-known, existing formalism (Transaction F-logic [12]), which fnctionally is a sperset of Ulixes. Howeer, as mentioned earlier, it is not intended to be sed by a designer or an end ser. Instead, naigation expressions that se this langage are generated atomatically from the naigation map. The interpreter than simply exectes these expressions when ser qeries need to be ealated. This is possible in or architectre de to the clear separation between the VPS and the logical layers of the database and also de to the se of or processoriented object model. It is worth pointing ot that maintenance of naigation expressions in or approach is mch simpler, since the naigation maps from which the processes are generated, can be pdated semi-atomatically (throgh mapping by example). Ariadne [15] is a system for extracting and integrating data from semi-strctred Web sorces. Ariadne has two foci: data extraction from nstrctred Web pages and what in or architectre amonts to mapping from the logical layer to the irtal physical layer. Both of these isses are orthogonal to or work. For instance, Ariadne s data extraction facilities as well as the body of techniqes for extracting information from semi-strctred data [31] cold be sed in or system. From the perspectie of or architectre, the focs of the work in the Information Manifold (IM) [20, 19] project can be iewed as mapping the logical layer to VPS. IM approaches the problem by first specifying the reerse (physical-to-logical) mapping, which they call sorce description. The reqired logical-to-physical mapping is then generated atomatically. The benefit of this indirect approach is claimed to be the ease of maintenance of the logical-to-physical mapping in iew of adding or deleting the Web sorces. In this way, IM is complementary to or work, since or focs is on bilding the VPS and the conceptal layers of the webbase. There is a large body of work on information mediators, sch as TSIMMIS [10], Hermes [1] and Garlic [30], which help smooth the semantic and syntactic differences between heterogeneos information sorces. Techniqes deeloped for information integration systems sch as these can be sed in or architectre for semantic integration of VPS relations that come from different sorces. On the other hand, these systems cold se or VPS atomation techniqes to gain access to dynamic Web content.
Finally, we shold note the growing commercial interest in inte ² gration of information from dierse Web sorces (e.g., Jnglee, Center Stage [11, 28]). Techniqes described in this paper can facilitate rapid deelopment of sch serices. 9 Conclsions and Ftre Directions In this paper we described a layered architectre for designing webbases. The separation of layers, which is analogos to traditional databases, simplifies the creation, maintenance, and se of webbases for retrieing information aailable on the Web. We hae implemented the main components of a prototype implementation of or architectre and reported on some preliminary experimental reslts. We hae shown that naigation maps can be created semiatomatically as the webbase designer browses sites, and that naigation expressions can be atomatically deried from these maps. These expressions are exected when ealating a qery, and ths optimizing sch expressions is an important problem that needs to be stdied. Or experiments sggest that parallelization of qery ealation is crcial for obtaining acceptable response times. Finally, while the idea of strctred UR as a qery interface seems to be promising in the context of webbases, more experimental work needs to be done to ealate the practicality of the idea. Acknowledgements: We wold like to thank arios people that contribted to this work: Vinod Anpam for sggestions on how to implement naigation by example; Daniel Liewen and C.R. Ramakrishnan, for careflly reading this manscript; and Narain Gehani, Daid Warren and Gizhen Yang for alable discssions. References [1] S. Adali, K. Candan, Y. Papakonstantino, and V.S. Sbrahmanian. Qery caching and optimization in distribted mediator systems. In Proc. of SIGMOD, pages 137 148, 1996. [2] J.L. Ambite, N. Ashish, G. Barish, C.A. Knoblock, S. Minton, P.J. Modi, I. Mslea, A. Philpot, and S. Tejada. Ariadne: A system for constrcting mediators for internet sorces. In Proc. of SIGMOD, 1998. [3] G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web qery langage. In Proceedings of the 6th International WWW Conference, April 1997. [4] P. Atzeni, G. Mecca, and P. Merialdo. Semistrctred nd strctred data in the web: Going back and forth. SIGMOD Record, 26(4):16 23, 1997. [5] P. Atzeni, G. Mecca, and P. Merialdo. To weae the web. In Proc. of VLDB, pages 206 215, 1997. [6] A.J. Bonner and M. Kifer. An oeriew of transaction logic. Theoretical Compter Science, 133:205 265, October 1994. [7] O.M. Dschka and A.Y. Ley. Recrsie plans for information gathering. In Proc. of IJCAI, 1997. [8] D. Floresc, A.Y. Ley, and A.O. Mendelzon. Database techniqes for the world-wide web: A srey. SIGMOD Record, 27(3):59 74, 1998. [9] J. Frohn, R. Himmeroeder, P.-Th. Kandzia, G. Lasen, and C. Schlepphorst. Florid - a prototype for F-logic. In 12th German Workshop on Logic Programming, 1997. http://www.informatik.ni-freibrg.de/³ dbis/florid/. [10] H. Garcia-Molina, Y. Papakonstantino, D. Qass, A. Rajaraman, Y. Sagi, J. D. Ullman, V. Vassalos, and J. Widom. The tsimmis approach to mediation: Data models and langages. Jornal of Intelligent Information Systems, 8(2):117 132, 1997. [11] http://www.jnglee.com. Jnglee Corporation. [12] M. Kifer. Dedctie and object-oriented data langages: A qest for integration. In Proc. of DOOD, pages 187 212, 1995. [13] M. Kifer, W. Kim, and Y. Sagi. Qerying object-oriented databases. In Proc. of SIGMOD, pages 393 402, 1992. [14] M. Kifer, G. Lasen, and J. W. Logical fondations of object-oriented and frame-based langages. Jornal of ACM, 42:741 843, Jly 1995. [15] C.A. Knoblock, S. Minton, J.L. Ambite, N. Ashish, P.J. Modi, I. Mslea, A.G. Philpot, and S. Tejada. Modeling web sorces for information integration. In Proc. of AAAI, 1998. [16] D. Konopnicki and O. Shmeli. W3QS: A qery system for the World-Wide Web. In Proc. of VLDB, pages 54 65, 1995. [17] L.V.S. Lakshmanan, F. Sadri, and I.N. Sbramanian. A declaratie langage for qerying and restrctring the WEB. In Workshop on Research Isses in Data Engineering, pages 12 21, 1996. [18] S. Lawrence and C.L. Giles. Searching the world wide web. Science, 280(4):98 100, 1998. [19] A.Y. Ley, A. Rajaraman, and J.J. Ordille. Qery-answering algorithms for information agents. In Proc. of AAAI, pages 40 47, 1996. [20] A.Y. Ley, A. Rajaraman, and J.J. Ordille. Qerying heterogeneos information sorces sing sorce descriptions. In Proc. of VLDB, pages 251 262, 1996. [21] D. Maier. The Theory of Relational Databases. Compter Science Press, 1983. [22] D. Maier, D. Rozenshtein, and D.S. Warren. Windows on the world. In Proc. of SIGMOD, pages 68 78, 1983. [23] D. Maier and J.D. Ullman. Maximal objects and the semantics of niersal relation databases. ACM TODS, 8(1):1 14, 1983. [24] D. Maier, J.D. Ullman, and M.Y. Vardi. On the fondations of the niersal relation model. ACM TODS, 9(2):283 308, 1984. [25] G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni. The aranes web-base management system. In Proc. of SIGMOD, pages 544 546, 1998. [26] G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni. From databases to web-bases: The aranes experience. Technical Report n. 34-1998, May 1998. [27] A.O. Mendelzon, G.A. Mihaila, and T. Milo. Qerying the World Wide Web. International Jornal on Digital Libraries, 1(1):54 67, 1997. [28] http://www.ondisplay.com. OnDisplay Corporation. [29] A. Rajaraman, Y. Sagi, and J.D. Ullman. Answering qeries sing templates with binding patterns. In Proc. of PODS, pages 105 112, 1995. [30] M.T. Roth, M. Arya, L.M. Haas, M.J. Carey, W.F. Cody, R. Fagin, P.M. Schwarz, J. Thomas II, and E.L. Wimmers. The garlic project. In Proc. of SIGMOD, page 557, 1996. [31] D. Sci, editor. Proc. of the SIGMOD Workshop on Management of Semistrctred Data, 1997.