A Fragment-Based Approach for Efficiently Creating Dynamic Web Content

Transcription

1 A Fragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING IBM Research This article presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates all Web pages affected by the change. Our system accommodates both content that must be proofread before publication and is typically from humans as well as content that has to be published immediately and is typically from automated feeds. We discuss some of our experiences with real deployments of our system as well as its performance. We also quantitatively present characteristics of fragments used at a major deployment of our publishing system including fragment sizes, update frequencies, and inclusion relationships. Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General General Terms: Design, Performance Additional Key Words and Phrases: Caching, dynamic content, fragments, publishing, Web, Web performance 1. INTRODUCTION Many Web sites need to provide dynamic content. Examples include sport sites, stock market sites, and virtual stores or auction sites where information on available products is constantly changing. Based on: A Publishing System for Efficiently Creating Dynamic Web Content, by Jim Challenger, Arun Iyengar, Karen Witting, Cameron Ferstat, and Paul Reed, which appeared in Proceedings of INFOCOM 2000, c 2000 IEEE. Authors address: IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598; Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY USA, fax: +1 (212) , or permissions@acm.org. C 2005 ACM /05/ $5.00 ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005, Pages

2 360 J. Challenger et al. There are several problems with providing dynamic data to clients efficiently and consistently. A key problem with dynamic data is that it can be expensive to create; a typical dynamic page may require several orders of magnitude more CPU time to serve than a typical static page of comparable size. The overhead for dynamic data is a major problem for Web sites that receive substantial request volumes. Significant hardware may be needed for such sites. A key requirement for many Web sites providing dynamic data is to completely and consistently update pages that have changed. In other words, if a change to underlying data affects multiple pages, all such pages should be correctly updated. In addition, a bundle of several changed pages may have to be made visible to clients at the same time. For example, publishing pages in bundles instead of individually may prevent situations where a client views a first page, clicks on a hypertext link to view a second page, and sees information on the second page that is older and not consistent with the information on the first page. Depending upon the way in which dynamic data are being served, achieving complete and consistent updates can be difficult or inefficient. Many Web sites cache dynamic data in memory or a file system in order to reduce the overhead of re-calculating Web pages every time they are requested [Iyengar and Challenger 1997]. In these systems, it is often difficult to identify which cached pages are affected by a change to underlying data that modifies several dynamic Web pages. In making sure that all obsolete data are invalidated, deleting some current data from cache may be unavoidable. Consequently, cache miss rates after an update may be high, adversely affecting performance. In addition, multiple cache invalidations from a single update must be made consistently. This article presents a system for efficiently and consistently publishing dynamic Web content. In order to reduce the overhead of generating dynamic pages from scratch, our system composes dynamic pages from simpler entities known as fragments. Fragments typically represent parts of Web pages that change together; when a change to underlying data occurs that affects several Web pages, the fragments affected by the change can easily be identified. It is possible for a fragment to recursively embed another fragment. Our system provides a user-friendly method for managing complex Web pages composed of fragments. Users specify how Web pages are composed from fragments by creating templates in a markup language. Templates are parsed to determine inclusion relationships among fragments and Web pages. These inclusion relationships are represented by a graph known as an object dependence graph (ODG). Graph traversal algorithms are applied to ODGs in order to determine how changes should be propagated throughout the Web site after one or more fragments change. Our system allows multiple independent authors to provide content as well as multiple independent proofreaders to approve some pages for publication and reject others. Publication may proceed in multiple stages in which a set of pages must be approved in one stage before it is passed to the next. Our system can also include a link checker, which verifies that a Web page has no broken hypertext links at the time the page is published. It is also scalable to handle high request rates.

3 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 361 In addition to describing the architecture of our fragment-based Web publishing system in detail, this article also presents the first empirical study of a large scale real deployment of a fragment-based Web publishing system that we are aware of. We present object size distributions, update frequency distributions, and characteristics of the fragment inclusion relationships. These statistics can be used to architect efficient Web publishing systems. For example, we have used fragment size distributions to optimize the manner in which fragments are stored on disk [Iyengar et al. 2001]. The remainder of the article is organized as follows. Section 2 discusses our methodology for constructing Web pages from fragments. Section 3 describes the architecture of our system in detail. Section 4 describes the performance of our system and presents statistics obtained from deploying our system at real Web sites. Section 5 discusses related work. Finally, Section 6 summarizes our main results and conclusions. 2. CONSTRUCTING WEB PAGES FROM FRAGMENTS 2.1 Overview A key feature of our system is that it composes complex Web pages from simpler fragments (Figure 9). A page is a complete entity that may be served to a client. We say that a fragment or page is atomic if it doesn t include any other fragments and complex if it includes other fragments. An object is either a page or a fragment. Our approach is efficient because the overhead for composing an object from simpler fragments is usually minor. By contrast, the overhead for constructing the object from scratch as an atomic fragment is generally much higher. Using the fragment approach, it is possible to achieve significant performance improvements without caching dynamic pages and dealing with the difficulties of keeping caches consistent. For optimal performance, our system has the ability to cache dynamic pages. Caching capabilities are integrated with fragment management. The fragment-based approach for generating Web pages makes it easier to design Web sites in addition to improving performance. It is easy to design a set of Web pages with a common look and feel. It is also easy to embed common information into several Web pages. Sets of Web pages containing similar information can be managed together. For example, it is easy to update common information represented by a single fragment but embedded within multiple pages; in order to update the common information everywhere, only the fragment needs to be changed. By contrast, if the Web pages are stored statically in a file system, identifying and updating all pages affected by a change can be difficult. Once all changed pages have been identified, care must be taken to update all changed pages in order to preserve consistency. Dynamic Web pages that embed fragments are implicitly updated any time an embedded fragment changes, so consistency is automatically achieved. Consistency becomes an issue with the fragment-based approach when the pages

4 362 J. Challenger et al. are being published to a cache or file system. Our system provides several different methods for consistently publishing Web pages in these situations; each method provides a different level of consistency. Fragments also provide a mechanism by which remote caches can store some parts of dynamic and personalized pages. The remote cache stores the static parts of a page. When a page is requested, the cache requests the dynamic or personalized fragments of the page from the server. Typically, this would only constitute a small fraction of the page. The cache then composes and serves the composite page. HTML contains an OBJECT tag which allows an arbitrary program object to be embedded in an HTML page. While the OBJECT tag has the potential to achieve fragment composition at the client, a major drawback is that major browsers don t support the tag properly. Therefore, we couldn t rely on the OBJECT tag for our Web sites. One of the key things our publishing system enables is separation of the creative process from the mechanical process of building a Web site. Previously, the content, look, and feel of large sites we were involved with had to be carefully planned well in advance of the creation of the first page. Changes to the original plans were quite difficult to execute, even in the best of circumstances. Lastminute changes tended to be impossible, resulting in a choice between delayed or flawed site publication. With our publishing system, the entire look and feel of a site can be changed and republished within minutes. Aside from the cost savings, this has allowed tremendous creativity on the part of designers. Entire site designs can be created, experimented with, changed, discarded, and replaced several times a day during the construction of the site. This can take place in parallel with and independently of the creation of site content. A specific example of this was demonstrated just before a new site look for the 2000 Sydney Olympic Games Web site was made public. One day before the site was to go live before the public, it was decided that the search facility was not working sufficiently well and must be removed. This change affected thousands of pages, and would previously have delayed publication of the site by as much as several days. Using our system, the site authors simply removed the search button from the appropriate fragment and republished the fragment. Ten minutes later, the change was complete, every page had been rebuilt, and the site went live on schedule. 2.2 Object Dependence Graphs When pages are constructed from fragments, it is important to construct a fragment f 1 before any object containing f 1 is constructed. In order to construct objects in an efficient order, our system represents relationships between fragments and Web pages by graphs known as object dependence graphs (ODGs) (Figures 1 and 2). Object dependence graphs may have several different edge types. An inclusion edge indicates that an object embeds a fragment. A link edge indicates that an object contains a hypertext link to another object.

5 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 363 Fig. 1. A set of Web pages containing fragments. Fig. 2. The object dependence graph (ODG) corresponding to Figure 1. In the ODG in Figure 2, all but one of the edges are inclusion edges. For example, the edge from f 4 to P 1 indicates that P 1 contains f 4 ; thus, when f 4 changes, f 4 should be updated before P 1 is updated. The graph resulting from only inclusion edges is a directed acyclic graph. The edge from P 3 to P 2 is a link edge, which indicates that P 2 contains a hypertext link to P 3. A key reason for maintaining link edges is to prevent dangling or inconsistent hypertext links. In this example, the link edge from P 3 to P 2 indicates that publishing P 2 before P 3 will result in a broken hypertext link. Similarly, when both P 2 and P 3 change, publishing a current version of P 2 before publishing a current version of P 3 could present inconsistent information to clients who view an updated version of P 2,click on the hypertext link to an outdated version of P 3, and then see information that is obsolete relative to the referring page. Link edges can form cycles within an ODG. This would occur, for example, if two pages both contain hypertext links to each other. There are two methods for creating and modifying ODGs. Using one approach, users specify how Web pages are composed from fragments by creating templates in a markup language. Templates are parsed to determine inclusion relationships among fragments and Web pages. Using the second approach, a program may directly manipulate edges and vertices of an ODG by using an API. We have deployed systems using both approaches. The first approach is easier for Web page designers to use but incurs higher overheads. Our system allows an arbitrary number of edge types to exist in ODGs. So far, we have only found practical use for inclusion and link edges. We suspect that

6 364 J. Challenger et al. there may be other types of important relationships that can be represented by other edge types. When our system becomes aware of changes to a set S of one or more objects, it does a depth-first graph traversal using topological sort [Cormen et al. 1990] to determine all vertices reachable from S by following inclusion edges. The topological sort orders vertices such that whenever there is an edge from a vertex v to another vertex u, v appears before u in the topological sort. For example, a valid topological sort of the graph in Figure 2 after P 3, f 4, and f 2 change would be P 3, f 4, f 2, f 5, P 2, f 1, f 3, and P 1. This topological sort ignores link edges. Objects are updated in an order consistent with the topological sort. Our system updates objects in parallel when possible. In the previous example, P 3, f 4, and f 2 can be updated in parallel. After f 2 is updated, f 1 and f 5 may be updated in parallel. A number of other objects may be constructed in parallel in a manner consistent with the inclusion edges of the ODG. After a set of pages, U, has been updated (or generated for the first time), the pages in U are published so that they can be viewed by clients. In some cases, the pages are published to file systems. In other cases, they are published to caches. Pages may be published either locally on the system generating them or to a remote system. It is often a requirement for a set of multiple pages to be published consistently. Consistency can be guaranteed by publishing all changed (or newly generated) pages in a single atomic action. One potential drawback to this method of publication is that the publication process may be relatively long. For example, pages may have to be proofread before publication. If everything is published together in a single atomic action, there can be considerable delay before any information is made available. Therefore, incremental publication, wherein information is published in stages instead of together, is often desirable. The disadvantage to incremental publication is that consistency guarantees are not as strong. Our system provides three different methods for incremental publication, each providing different levels of consistency. The first incremental publishing method guarantees that a freshly published page will not contain a hypertext link to either an obsolete or unpublished page. This consistency guarantee applies to pages reached by following several hypertext links. More specifically, if P 1 and P 2 are two pages in U, ifaclient views an updated version of P 1 and follows one or more hypertext links to view P 2, then the client is guaranteed to see a version of P 2 that is not obsolete with respect to the version of P 1 that the client viewed (a version of P 2 is obsolete with respect to a version of P 1 if the version of P 2 was outdated at the time the version of P 1 became current, regardless of whether P 1 or P 2 have any fragments in common). For example, consider the Web pages in Figure 3. A client can access P 3 by starting at P 1, following a hypertext link to P 2 and then following a second hypertext link to P 3. Suppose that both P 1 and P 3 change. The first incremental publishing method guarantees that the new version of P 1 will not be published before the new version of P 3, regardless of whether P 2 has changed.

7 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 365 Fig. 3. A set of Web pages connected by hypertext links. Fig. 4. A set of Web pages containing common fragments. This incremental publishing method is implemented by first determining the set R of all pages that can be reached by following hypertext links from a page in U. R includes all pages of U; itmay also include previously published pages that haven t changed. R is determined by traversing link edges in reverse order starting from pages in U. Let K be the subgraph of the ODG consisting of all nodes in R and link edges in the ODG connecting nodes in R. The strongly connected components of K are determined. A strongly connected component of a directed graph is a maximal subset of vertices S such that every vertex in S has a directed path to every other vertex in S. A good algorithm for finding strongly connected components in directed graphs is contained in Cormen et al. [1990]. A graph K scc is constructed in which each vertex represents a different strongly connected component of K and there is an edge between two nodes (S, T) ofk scc if and only if there exist nodes s S and t T such that (s, t) is an edge in K. K scc is then topologically sorted and examined in an order consistent with the topological sort to locate pages in U. Each time a page in U for which the updated version hasn t been published yet is examined, the page is published together with all other pages in U belonging to the same strongly connected component. Each set of pages that are published together in an atomic action is known as a bundle. The second incremental publishing method guarantees that any two pages in U that both contain a common changed fragment are published in the same bundle. For example, consider the Web pages in Figure 4. Suppose that both f 1 and f 2 change. Since P 1 and P 3 both embed f 1, their updated versions must be published together. Since P 2 and P 3 both embed f 2, their updated versions must also be published together. Thus, updated versions of all three Web pages must be published together. Note that updated versions of P 1 and P 2 must be published together, even though the two pages don t embed a common fragment. In order to implement this approach, the set of all changed fragments contained within each changed object d 1 is determined. We call this set the changed fragment set for d 1 and denote it by C(d 1 ). All changed objects are constructed in topological sorting order. When a changed object d 1 is constructed, C(d 1 )is calculated as the union of f 2 and C( f 2 ) for each changed fragment f 2 such that an inclusion edge ( f 2, d 1 ) exists in the ODG.

8 366 J. Challenger et al. Fig. 5. Another set of related Web pages. After all changed fragment sets have been determined, an undirected bipartite graph D is constructed in which the vertices of D are pages in U on one side and changed fragments contained within pages of U on the other side. For each page P U, anundirected edge ( f, P) iscreated for each f C(P). D is examined to determine its connected components (two vertices are part of the same connected component if and only if there is a path between the vertices in the graph). All pages in U belonging to the same connected component are published in the same bundle. The third incremental publishing method satisfies the consistency guarantees of both the first and second methods. In other words, (1) A freshly published page will not contain a hypertext link to either an obsolete or unpublished page. More specifically, if P 1 and P 2 are two pages in U, ifaclient views an updated version of P 1 and follows one or more hypertext links to view P 2, then the client is guaranteed to see a version of P 2 that is not obsolete with respect to the version of P 1 that the client viewed. (2) Any two changed pages that both contain a common changed fragment are published together. This method generally results in publishing fewer bundles but of larger sizes than the first two approaches. For example, consider the Web pages in Figure 5. Suppose that both P 1 and f 1 change. Updated versions of P 2 and P 3 must be published together because they both embed f 1. Since P 1 contains a hypertext link to P 3, the updated version of P 1 cannot be published before the bundle containing updated versions of P 2 and P 3. If, instead, the first incremental publishing method were used to publish the Web pages in Figure 5, the updated version of P 1 could not be published before the updated version of P 3. However, the updated version of P 2 would not have to be published in the same bundle as the updated version of P 3. If the second incremental publishing method were used, updated versions of both P 2 and P 3 would have to be published together in the same bundle. However, publication of the updated version of P 1 would be allowed to precede publication of the bundle containing updated versions of P 2 and P 3. The third incremental publishing method is implemented by constructing K as in the first method and D as in the second. Additional edges are then added to K between pages in U to ensure that all Web pages belonging to the same connected component of D belong to the same strongly connected component of

9 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 367 K. The same procedure as in the first method is then applied to K to publish pages in bundles. Incremental publishing methods can be designed for other consistency requirements as well. For example, consider Figure 3. Suppose that both P 1 and P 3 change. It may be desirable to publish updated versions of P 1 and P 3 in the same bundle. This would avoid the following situation, which could occur using the first method. Aclient views an old version of P 1. After following hypertext links, the client arrives at a new version of P 3. The browser s cache is then used to go to the old version of P 1. The client reloads P 1 in order to obtain a version consistent with P 3 but still sees the old version because the new version of P 1 has not yet been published. It is straightforward to implement an incremental publishing method which would publish P 1 and P 3 in the same bundle using techniques similar to the ones just described. 2.3 Combined Content Pages Many Web sites contain information that is fed from multiple sources. Some of the information, such as the latest scores from a sporting event, is generated automatically by a computer. Other information, such as news stories, is generated by humans. Both types of information are subject to change. A page containing both human and computer-generated information is known as a combined content page. A key problem with serving combined content pages is the different rates at which sources produce content. Computer-generated content tends to be produced at a relatively high rate, often as fast as the most sophisticated timing technology permits. Human-generated content is produced at a much lower rate. Thus, it is difficult for humans to keep pace with automated feeds. By the time an editor has finished with a page, the actual results on the page may have changed. If the editor takes time to update the page, the results may have changed yet again. A requirement for many of the Web sites we have helped design is that computer-generated content should not be delayed by humans. Computergenerated results, such as the latest results from a sporting event, are often extremely important and should be published as soon as possible. If computergenerated results are combined with human-edited content using conventional Web publishing systems, publication of the computer-generated results can be significantly delayed. What is needed is a scheme to combine data feeds of differing speeds so that information arriving at high rates is not unnecessarily delayed. In order to provide combined content pages, our system divides fragments into two categories. Immediate fragments are fragments that contain vital information, which should be published quickly with minimal proofreading. On the sports Web sites that our system is being used for, the latest results in a sporting event would be published as an immediate fragment. Quality controlled fragments are fragments that don t have to be published as quickly as

10 368 J. Challenger et al. immediate fragments but have content that must be examined in order to determine whether the fragments are suitable to be published. Background stories on athletes are typically published as quality controlled fragments at the sports sites that use our system. Combined content Web pages consist of a mixture of immediate and quality controlled fragments. When one or more immediate fragments change, the Web pages affected by the changes are updated and published with minimal proofreading. If both immediate and quality controlled fragments change, the system first performs updates resulting from the immediate fragments and publishes the updated Web pages with only minor proofreading to ensure that the immediate fragments are properly placed in the Web pages; the immediate fragments themselves are not proofread. The system subsequently performs updates resulting from quality controlled fragments and only publishes these updated Web pages after they have been proofread. Multiple versions of a combined content page may be published using this approach. The first version would be the page before any updates. The second version might contain updates to all immediate fragments but not to any quality controlled fragments. The third version might contain updates to all fragments. It is possible for an update to an immediate fragment f 1 to be published before an update to a quality controlled fragment f 2 even though f 2 changed before f 1. This might occur if the changes to f 2 are delayed in publication due to proofreading. 3. SYSTEM ARCHITECTURE Web pages produced by our system typically consist of multiple fragments. Each fragment may originate from a different source and may be produced at a different rate than other fragments. Fragments may be nested, permitting the construction of complex and sophisticated pages. Completed pages are written to sinks, which may be file systems, caches, or even other HTTP servers. The Trigger Monitor is the software that takes objects from one or more sources, constructs pages, and writes the constructed pages to one or more sinks (Figure 6). Relationships among fragments are maintained in a persistent ODG, which preserves state information in the event of a system crash. Whenever the Trigger Monitor is notified of a modification or addition of one or more objects, it fetches new copies of the changed objects from the appropriate source. The ODG is updated by parsing new and changed objects. The graph traversal algorithms described in Section 2.2 are then applied to the latest version of the ODG to determine all Web pages that need to be updated and an efficient order for updating them. Finally, bundles of published pages are written to the sinks. Since the Trigger Monitor is aware of all fragments and pages, synchronization is possible to prevent corruption of the pages. The ODG is used as the synchronization object to keep the fragment space consistent. Many trigger handlers, each with their own sources and sinks, may be configured to use a common ODG. This design permits, for example, a slow-moving, carefully edited human-generated set of pages and fragments to be integrated with

11 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 369 Fig. 6. Schematic of the publish process. a high-speed, automated, database-driven content source. Because the ODG is aware of the entire fragment space and the interrelationship of the objects within that space, synchronization points can be chosen to ensure that multiple, differently-sourced, differently-paced content streams remain consistent. Multiple Trigger Monitor instances may be chained, the sinks of earlier instances becoming the sources for later ones. This allows publication to take place in multiple stages. We have typically used the following stages in real deployments (Figure 7): Development is the first step in the process. Fragments that appear on many Web pages (such as generic headers and footers) as well as overall site design occur here. The output of development may be structurally complete but lacking in content. Staging takes as its input, or source, the output, or sink, of Development. Editors polish pages and combine content from various sources. Finished pages are the result. Quality Assurance takes as its source the sink of Staging. Pages are examined here for correctness and appropriateness. The quality assurance people examine whole pages, which include content from the automated results feed. In determining whether to reject a page, however, the quality assurance people will not scrutinize fragments corresponding to automated results. Automated Results are produced when a database trigger is generated as the result of an update. The trigger causes programs to be executed that extract

12 370 J. Challenger et al. Fig. 7. Schematic of the publish process. current results and compose relevant updated pages and fragments. Unlike the previous stages, no human intervention occurs in this stage. Production is where pages are served from. Its source is the sink of QA, and its sinks are the serving directories and caches. Note how one stage can use the sink of another stage as its source. The automated feed updates sources at the same time, but independently of the human-driven stages. This achieves the dual goals of keeping the entire site consistent while immediately publishing content from automated feeds. Stages can be added and deleted easily. Data sources can be added and deleted with little or no disruption to the flow. 3.1 Actual Deployments We now describe how our publishing system is typically deployed. Approximately three dozen different object types are produced by various data sources. These objects include entities such as images, PDF files, style sheets, movies, and HTML fragments. Objects are categorized into four primary classes based on how they participate in page assembly, and two secondary classes based on whether they are distributed as servable pages. Table I describes the four primary object types.

13 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 371 Table I. The Four Primary Object Types Embeds Other Fragments Doesn t Embed Other Fragments Embedded in other Fragment Intermediate Leaf Not Embedded in other Fragment Top-Level Binary Table II. Secondary Object Types Input to Assembly Generated by Assembly Intermediate Partial Top-Level Servable Html Binary objects such as images, sound clips, and movies are embedded in pages by virtue of HTML tags and therefore do not affect the page assembly process. The publishing system passes such objects directly from their source location (e.g. authors, web-cams) to the sink, which distributes them to the origin server s serving directory. The page assembly process produces two secondary classes of objects. A toplevel object is transformed into a final, servable HTML page after page assembly. This page is sent to the sink for distribution to the servers. An intermediate object is transformed into a partially assembled object by embedding those objects it refers to. The partials are written to a persistent cache for potential reuse in subsequent page assemblies. Table II summarizes the two secondary object types. Objects sent to the publishing system generally go through four steps: (1) Read object from source, (2) Update object dependence graph, (3) Assemble pages affected by the object, and (4) Save input object and assembled objects to a persistent cache and to sink for distribution. Figure 8 shows the flow of data through the publishing system for each type of object. The two repositories, source and assembled, are disk-based caches. The source repository caches new objects for potential reuse in subsequent publish operations. The assembled repository caches partial objects for potential reuse. Finally, servable pages are written to sinks for distribution to the servers. The act of publishing a page consists of the authoring system sending a message to the publishing system containing the names of the objects that have changed and that have been validated as ready for distribution. The first step taken by the publishing system is to fetch each object from the authoring system. If the object is a fragment of any sort, that is, a leaf, intermediate, or top-level object, it is then saved into the source repository. Binary objects (which do not participate in assembly) do not need to be cached. The fragments are then analyzed for dependencies, and the ODG is updated to reflect the new state of the system. After ensuring that the dependencies in the ODG are consistent with the new objects received, the ODG update task issues a request that returns a list of objects needing reassembly.

14 372 J. Challenger et al. Fig. 8. Dataflow through the publishing system. Page assembly now occurs. The page assembler will pull needed objects from the cache. If a partial object is being embedded in an object requiring reassembly, then the page assembler will pull the embedded object from the assembled repository (rather than the source repository). This extra cache eliminates the need to reassemble embedded partial objects, which might be costly. Finally, the output of the assembly process is written to the appropriate place. If the output of assembly is a partial object, it is written to the assembled repository for possible use in a future assembly. If the output of assembly is a servable page, then it is written to the sinks for distribution to the content servers. Triggered binary objects are also written to sinks. There are three major disk-based caches used in the publishing process: The source repository. All page fragments are placed into this repository upon entering the publishing system and retrieved during page assembly. The publish message plays the role of a cache-invalidation message for the source repository. The assembled repository. During page assembly, intermediate fragments (Table II) are built into partial fragments. These partial fragments are saved for reuse during page assembly. Objects in this cache are invalidated when ODG analysis determines that at least one of the fragments making up the partial fragment changes. The Object Dependence Graph (ODG). This is, in effect, a cache, because all of the information in it also resides in the database containing the fragments (which is, however, prohibitively expensive to search).

15 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 373 All three of these caches can be rebuilt in the event of a failure by republishing all the fragments in the authoring system s database. It is critical that these caches be persistent. The time to rebuild all three caches after total failure can be several hours for a major Web site. The caches are implemented as disk-backed hash tables [Iyengar et al. 2001]. The cache interfaces are implemented as Java(2) Map interfaces, compatible with the Java(2) HashMap, making it trivial to exchange a memory-based hash table with a disk-backed hash table. We exploit the disk-based cache in several ways: Rapid start-up after shutdown or failure. The time required to restart the entire publishing system is about three orders magnitude faster with a warm cache than without one. Space consumed by the publishing system can easily overflow main memory. The disk-based caches provide sufficient storage for situations where main memory would be insufficient. Replication of the publishing system. It is often desirable to have several instances of the publishing system installed for purposes such as development, test, recovery, and so on. Because each cache is implemented as a single file, it is easy to replicate the state of the publishing system for use elsewhere. 3.2 Examples To demonstrate how a site might be built from fragments, we present an example from a Web site for a French Open Tennis Tournament. A site architect views the player page for Steffi Graf (shown in Figure 9) as consisting of a standard header, sidebar, and footer, with biographical information and recent results thrown in. The site architect composes HTML similar to the following, establishing a general layout for the site: <html>  <table> <tr> <td></td> <td><table> <tr></tr> <tr></tr> </td></table> </tr> </table>  </html> where footer.frg consists of   The form just shown for specifying fragments is concise and efficient. Since the fragments are specified by HTML comments, browsers and other entities,

16 374 J. Challenger et al. Fig. 9. Sample screen shot demonstrating the use of fragments. which don t recognize fragments will simply ignore directives related to fragments. Our system also allows fragments to be specified using the ESI language [Tsimelzon et al. a standard that has been developed for specifying fragments. When the first version of our publishing system was developed, ESI was not yet in existence. Prior to the beginning of play, the contents of graf score.frg will be empty, since no matches have commenced. This means the part of the page outlined by the dashed box in Figure 9 will, at first, be empty. The first publication of this fragment will result in the ODG seen to the right of Steffi Graf s player page in Figure 9. Again, the objects and edges within the dashed box will not yet be within the ODG, since no match play has yet occurred. Using fragments in this way permits many architects, editors, and even automated systems to modify the page simultaneously. Our system ensures that all changes are properly included in the final page that is seen by the user. An architect updating the structure of the page does not need to know anything about copyrights, trademarks, the size of the sponsor s logos, the look-and-feel of the site, or any of the data that will be included on the page. Similarly, an editor wishing to change the look-and-feel of a site does not need to understand the structure of any particular page. Certain types of major site changes are greatly facilitated by our publishing system. For example, changing the sidebar to reflect the end of a long event

17 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 375 is as simple as updating sidebr.frg. To change the look-and-feel of a site, an editor only needs to change header.frg and footer.frg. For both these kinds of changes, the system will use the ODG from Figure 9 to determine that Steffi Graf s page must be rebuilt (along with many others). Once all pages have been rebuilt, they will be republished. The user will see the changes on every page, although the vast majority of underlying fragments will not have changed. More static information, like player biographies, can be kept up-to-date in one place but used on many pages. For example, graf bio.frg is used on our example page, but may also be used in many other places. To include a new photo or update the information included in the biography, the editors need only concern themselves with updating graf bio.frg. The system ensures that all pages that include graf bio.frg will automatically be rebuilt. Since scoring information will change frequently once a tennis match is in progress, updating that aspect of a page can be handled by an automated process. As a match begins, graf score.frg is updated to include the match in progress. This means that once the final has begun, the graf score.frg page will consist of HTML similar to   When the updated graf score.frg is published, the system will detect that it now includes final.frg and semi.frg and will update the ODG as shown in the dashed box within Figure 9. Now, as the final match progresses, only final.frg needs to be updated and published through our system. As part of the publication process, the system will detect that final.frg is included in graf score.frg, causing graf score.frg to be rebuilt using the updated score. Likewise, the system will detect that Steffi Graf s page must be rebuilt as well, and a new page will be built including the updated scoring information. Eventually, when the match completes, the complete page shown in the example is produced. The score for the final match will be displayed on many pages other than Steffi Graf s player page. For instance, Martina Hingis s player page will also include these results, as will the scoreboard page while the match is in progress. A page listing match-ups between different players will also contain the score. To update all of these pages, the automated system only updates one fragment. This keeps the automated system independent of the site design. A more complex example of a Web page with fragments is shown in Figure 10, which depicts the Athletics Home Page from the 2000 Olympic Games Web Site on October 1, Both the header and footer are in separate frames. This reduces the size of pages and the amount of information that needs to be loaded when navigating between pages. It also allows clients to access the top and bottom navigation elements at any time since when scrolling through pages, they do not move. The page contains a total of 46 fragments, a typical number for the Web site. The header contains 1 top-level fragment and 12 embedded fragments. The footer contains 1 top-level fragment and 3 embedded fragments. Neither the header nor footer were changed during the games. The Athletics Home

18 376 J. Challenger et al. Fig. 10. Another example of a Web page composed of fragments.

19 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 377 Fig. 11. The CPU time in milliseconds required to construct and publish bundles of various sizes. Page frame contains 9 top-level fragments and 20 embedded fragments. This page was updated frequently, and fragments were an essential component in reducing the overhead for updates. 4. SYSTEM PERFORMANCE AND DEPLOYMENT STATISTICS This section describes the performance of a Java implementation of our system running on an IBM Intellistation containing a 333 Mhz Pentium II processor with 256 Mbytes of memory and the Windows NT (version 4.0) operating system. The distribution of Web pages sizes is similar to the one for the 1998 Olympic Games Web site [Iyengar et al. 1999] as well as more recent Web sites deploying our system; the average Web page size is around 10 Kbytes. Fragment sizes are typically several hundred bytes but usually less than 1 Kbyte. The distribution of fragment sizes is also representative of real Web sites deploying our system. Figure 11 shows the CPU time in milliseconds required for constructing and publishing bundles of various sizes. Times are averaged over 100 runs. All 100 runs were submitted simultaneously, so the times in the figure reflect the ability for the runs to be executed in parallel. The solid curve depicts times when all objects that need to be constructed are explicitly triggered. The dotted line depicts times when a single fragment that is included in multiple pages is triggered; the pages that need to be built as a result of the change to the fragment are determined from the ODG. Graph traversal algorithms applied to the ODG have relatively low overhead. By contrast, each object that is triggered has to be read from disk and parsed; these operations consume considerable CPU overhead. As the graph indicates, it is more desirable to trigger a few objects that are included in multiple pages than to trigger all objects that need to be constructed. Our implementation allows multiple complex objects to be constructed in parallel. As a result, we are able to achieve near 100% CPU utilization, even when construction of an object was blocked due to I/O, by concurrently constructing other objects.

20 378 J. Challenger et al. Fig. 12. The breakdown in CPU time required to construct and publish a typical complex Web page. The breakdown as to where CPU time is consumed is shown in Figure 12. CPU time is divided into the following categories: Retrieve, parse: time to read all triggered objects from disk and parse them for determining included fragments. ODG update: time for updating the ODG based on the information obtained from parsing objects and for analyzing the ODG to determine all objects that need to be updated and an efficient order for updating the objects. Assembly: time to update all objects. Save data: time to save all updated objects on disk. Send ack: time to send an acknowledgment message via HTTP that publication is complete. In the bars marked 1to100, one fragment included in 100 others was triggered. The 100 pages that needed to be constructed were determined from the ODG. In the bars marked 100 to 100, the 100 pages that needed to be constructed were all triggered. The times shown in Figure 12 are the average times for a single page. The total average time for constructing and publishing a page in the 1 to 100 page is milliseconds (represented by the aggregate of all bars); the corresponding time for the 100 to 100 case is milliseconds. The retrieve and parse time is significantly higher for the 100 to 100 case because the system is reading and parsing 100 objects compared with 1 in the 1to100 case. Since the source for every object that is triggered must be saved, the time it takes to save the data is somewhat longer when 100 objects are triggered than when only one object is triggered. Figure 13 shows how the construction and publication time averaged over 100 runs varies with the number of embedded fragments within a Web page. While the time grows with the number of fragments, the overhead for assembling a Web page with several embedded fragments is still relatively low. A Web

21 Fragment-Based Approach for Efficiently Creating Dynamic Web Content 379 Fig. 13. The average CPU time in milliseconds required to construct and publish a complex Web page as a function of the number of embedded fragments. In each case, one fragment in the page was triggered. Fig. 14. The average CPU time in milliseconds required to construct and publish a complex Web page as a function of the number of fragments triggered. page constructed from multiple database accesses may require over a second of CPU time to construct from scratch without using fragments. The benefits of reducing database accesses by assembling Web pages from pre-existing fragments can thus be significant. Figure 14 shows how the construction and publication time averaged over 100 runs varies with the number of fragments that are triggered for a Web page containing 20 fragments. Since each fragment that is triggered has to be read from disk and parsed, overhead increases with the number of fragments triggered. 4.1 Deployment Statistics We now present statistics we collected from a deployment of our publishing system at a major Web site. 1 The statistics from this section can be used to design efficient publishing systems. This deployment did not make use of link edges or incremental publication as described in Section 2.2. Standard third 1 The 2000 Olympic Games Web site.

22 380 J. Challenger et al. Fig. 15. The distribution of object sizes in the source repository. Each bar represents the number of objects contained in the size range whose upper limit is shown on the X-axis. Fig. 16. The distribution of object sizes in the assembled repository. Each bar represents the number of objects contained in the size range whose upper limit is shown on the X-axis. party tools were used to check hypertext links for correctness. Pages were extensively tested to make sure that they worked as designed. The publication system used two stages. The first stage consisted of development, staging, and quality assurance. The second stage consisted of production. Real-time results were fed directly to the production server, and quality assurance for such pages was handled after the pages were published. As the site grew in size over the course of the event, such changes were only done at night. Figures 15 and 16 characterize the object size distributions within two of the disk-based caches used in the publishing system. Figure 17 shows the distribution of the number of incoming edges for ODG nodes. This graph represents the number of top level fragments embedded in the object corresponding to the node. The total number of fragments embedded in the object may be higher due to the fact that a fragment may recursively embed another fragment. Figure 18 shows the distribution of the number of outgoing edges for ODG nodes. This graph represents the number of objects that