50 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

Transcription

1 50 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

2 BY IAN FOSTER AND ROBERT L. GROSSMAN DATA INTEGRATION IN A BANDWIDTH-RICH WORLD Inexpensive storage and wide-area bandwidth (with prices for both declining at least as fast as Moore s Law) drive demand for middleware to integrate, correlate, compare, and mine local, remote, and distributed data. Exponential advances in sensors, storage systems, and computers are producing data of unprecedented quantity and quality. Multi-terabyte and even petabyte (1,000TB) data sets are emerging as major assets. For example, the climate science community has access to hundreds of terabytes of observational data from NASA s Earth-observing system and simulation data from high-performance climate models; these data sources can yield new insights into global change. The World-Wide Telescope linking hundreds of digital sky surveys is revolutionizing astronomy [11]. And in industry, multi-terabyte (soon to be petabyte) data warehouses of consumer transactional data are increasingly common. IN-SPIRALING MERGER OF TWO BLACK HOLES. SWIRLING RED TENDRILS ARE OUTWARD-TRAVELING GRAVITATIONAL WAVES. (SIMULATION DATA: PETER DIENER AND THOMAS RADKE, BOTH MAX PLANCK INSTITUTE FOR GRAVITATIONPHYSICS, ALBERT EINSTEIN INSTITUTE/POTSDAM GEANY; VISUALIZATION: JOHN SHALF USING THE VISAPULT TOOL DEVELOPED BY WES BETHEL, LAWRENCE BERKELEY NATIONAL LABORATORY) COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

3 The key to deriving insight and knowledge is often the correlation of data from multiple sources, as these examples show. The traditional paradigm for such syntheses is to gather data at a single location and transform it into a common format prior to exploring it. However, the expense of this approach in terms of network resources has meant that most data is never correlated or compared to other data. In a world of more and more data, storage systems, computers, and networks, it is both necessary and feasible for system architects to think in terms of a new paradigm based on data integration the flexible and managed federation, exploration, and processing of data from many sources. One factor driving this new paradigm is dramatic improvements in network performance. Few Internet1 networks move data at more than a megabit per second (Mbps), taking weeks to move a terabyte. Fortunately, advances in networking technologies are ushering in an era of bandwidth abundance based on Tbps optical backbones providing routine access to end-toend paths of 10Gbps or more a four-ordersof-magnitude improvement (see the article by DeFanti et al. in this section). For example, in 2002 Earth science data striped over a three-node cluster was recently transported at a rate of 2.4Gbps between Amsterdam and Chicago, a terabyte in an hour [9]. Just as critical for effective data integration, and our focus here, is the distributed system middleware beginning to allow distributed communities, or virtual organizations, to access and share data, networks, and other resources in a controlled and secure manner. Recent advances promise to provide required capabilities. For example, Open Grid Services Architecture (OGSA) standards and technologies provide for the secure and reliable virtualization and management of distributed data and computing resources [2, 6]. And data Web infrastructures support discovery, exploration, analysis, integration, and mining of remote and distributed data [10]. Such efforts are pioneering a new generation of distributed data discovery, access, and exploration technologies promising to transform the Internet into a data-integration platform. On it, users will be able to perform sophisticated operations on remote and distributed petascale data sets (see the sidebar Data-Integration Technologies ). Petascale Scenarios The following scenarios illustrate applications impossible today but achievable over optical networks with only the help of data services. Virtual data warehouses. Today, data warehouses are centralized repositories of data used for reporting and querying. High-speed optical networks make it possible for data to instead be stored at its source. When reports are required, bandwidth can be requested, data merged from multiple sources, and reports generated using the most current data. In effect, virtual data warehouses are constructed on the fly. Here, as elsewhere, new data architectures become possible when wide-area networks are able to transport data at speeds comparable to that of a computer s backplane. Security service Discovery R Access Data replication for business continuity. Businesses providing critical infrastructure for disaster recovery and business continuity are increasingly locating secondary, even tertiary, backup facilities far from primary sites. In the past, large data volumes, high transaction speeds, slow networks, and poor distributed data management infrastructure made such distributed architectures difficult or impossible. However, with all-optical networks and distributed data services, it becomes feasible to consider replicating transformations from core systems to remote backup systems. Financial exchanges, telecommunication systems, reservation systems, and dispatch and scheduling operations performed by vendors are examples of systems that can be replicated in this way. Stream-based distributed processing of sensor data. Large centralized detectors and distributed sensor nets in such fields as physics, astronomy, seismology, and national security produce high-volume data streams requiring extensive processing prior to analysis. Today, processing is performed offline, and data sets are pre- R Policy service Figure 1. Major components and activities in a data-integration architecture. Happy users interact with various public or private registries, each providing a particular view of available data, to discover candidate data. They then dispatch requests (dark arrows) to access and/or explore (white circles) remote data. Each such request, along with resulting interstorage-system transfers (dashed and dotted arrows), is subject to resource management controls at various points (labeled ), typically under the control of security and policy services. 52 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

4 IN EFFECT, DATA WAREHOUSES ARE CONSTRUCTED ON THE FLY. pared and distributed only periodically. Optical networks and data-integration services can enable a new paradigm in which even large data sets are continuously updated, so users always have access to the most current data. Data can also be merged from multiple sources, processed in real time, and analyzed for changes, alerts, and other significant patterns. Requirements and Technologies Distributed data sources can be diverse in their formats, schema, quality, access mechanisms, ownership, access policies, and capabilities. Overcoming this multi-tiered Tower of Babel to achieve distributed data integration requires technical solutions and standards in three closely related areas: data discovery and access; data exploration and analysis; and resource management, security, and policy (see Figure 1). Data discovery and access. The first step in integrating data is discovering data that may be relevant, often through middleware that examines metadata. Meta- The following projects are contributing to the development of data-integration middleware: Data Web. This open source, Web-based software supports access, exploration, analysis, integration, and mining of remote and distributed data (see Earth System Grid. This U.S. government-funded project applies data Grid technologies to the integration of Earth system modeling data (see EU DataGrid. This European Union-funded project develops and applies data Grid technologies in highenergy physics and other domains (see Globus Toolkit. This open-source software provides the basic infrastructure for many Grid deployments worldwide (see Open Grid Services Architecture. This integration of Grid and Web services technologies defines standard interfaces and behaviors for distributed system integration and management (see and Data-Integration Technologies OGSA Data Access and Integration. This Global Grid Forum working group defines service-oriented interfaces for manipulating distributed data sources (see Storage Resource Broker. This data access and federation technology provides data-mediation functions for data-intensive science (see Semantic Web. This extension of the Web aims to give information well-defined meaning (see Virtual Data Toolkit. This product of the National Science Foundation s GriPhyN project (see integrates data management and analysis technologies, including the Globus Toolkit, Condor, and Chimera. Web Services. This widely used set of standards specifies how applications define, discover, and access network-accessible services (see /ws). c COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

5 data can be represented, federated, and accessed in a variety of ways. Relevant technologies include Web services mechanisms. For example, there s the Web Services Description Language specifications; Gridenabled data access and integration services [2]; directory services (such the Lightweight Directory Access Protocol); XML and relational databases; Semantic Web technologies [5]; and text-based Web search mechanisms applied to unstructured text-based metadata. Having identified data sets that might be relevant, the next step for the user is to access the data to see whether it is likely to be relevant and actually worth investigating. Data formats, schema, and access mechanisms span a broad range. Widely adopted access mechanisms include: the Open source project for a Network Data Access Protocol (OPeNDAP) in the environmental community; Storage Resource Broker (SRB) [3] in scientific projects; Data Web protocols for data mining (the Data Space Transfer Protocol, or DSTP); and GridFTP for high-performance and striped data movement. The OGSA-based Data Access and Integration (OGSA-DAI) [2] standards emerging from the Global Grid Forum seek to integrate these and other approaches. Data access can demand high transport performance and require parallel data access and movement. For example, if remote data is being delivered at a rate of 1Gbps, and a particular application s data-integration activity involves reading 10 local bytes per remote byte received and performing 100 operations per local byte read, then the application requires 10Gbps local read bandwidth and 1Teraops/sec. of local computing to keep up with data delivery (a substantial and necessarily parallel resource). Striping data using multiple network connections linking pairs of nodes in distributed clusters is becoming a core technique in high-performance data transport [10]. The GridFTP extensions to the popular FTP protocol represent a standard approach to exploiting parallelism in data transfers, allowing multiple data channels to be coordinated via FTP control channel commands. Also relevant is the work on advanced protocols described in the article by Falk et al. in this section. We anticipate the emergence of data access services supporting the flexible creation and manipulation of views on data sources (whether files or tables) and access to those views using a variety of operations, including database-style operations (such as SQL select ) and other more general operations (such as attribute selection, row selection via range queries, and record selection via sampling). Integrating these mechanisms with high-performance transport protocols remains a major unresolved problem. Data exploration and analysis. Data rendered accessible can be analyzed in detail. Here, data exploration services are needed to address the challenges inherent in finding relevant data that can be combined with local data or with other remote data to achieve new discoveries. These services can provide basic statistical summaries, enable visual exploration of data, and support standard exploratory functions (such as building clusters), computing the regression of one variable on another. Efficient integration of distributed data requires Figure 2. Computer scientists have a good understanding of how to perform relational joins when data is at rest in a single location. An important method for data integration is to join distributed data in motion to look for patterns across data sets. An experiment at the igrid 2002 Conference in Amsterdam integrated (on the fly) climate data from Chicago with vegetation data in Amsterdam at transfer rates greater than 2.4Gbps, a land-speed record at the time; integration involved two distributed three-node clusters and employed the SABUL data-transport protocol. CCM3 data in Chicago (x i, r i, s i, t i ) Vegetation data in Amsterdam (y i, r i, s i, t i ) protocols and services for managing the data records constituting data archives. Unlike files of bits, data archives of records have attributes, attribute metadata, keys, and missing values. Mechanisms for providing attribute- and record-based access to remote and distributed data include: SQL-based access methods for relational data; protocols designed to work with remote data (such as the Data Web Transfer Protocol [10], OPeNDAP, and OGSA-DAI [2]); and protocols designed to work with remote and distributed semistructured data (such as XPath). Data Webs support the exploration and mining of distributed data using templated data-mining operations. The transformation, analysis, and synthesis performed during data integration can be complex and 54 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

6 THE EXPERIMENT DEMONSTRATED CONCLUSIVELY THAT GEOGRAPHICAL DISTANCE NEED NOT BE AN OBSTACLE TO DATA INTEGRATION. computationally intensive. Data-transformation primitives incorporated into data middleware cannot capture arbitrary computations but can express many common data-preparation operations [10]. More general workflow services are also required to support the integration and scheduling of arbitrary user- and community-defined transformations. Users benefit from tools that record, organize, and exploit knowledge about how these activities derive new data from old. Virtual data systems aim to capture this information so as to allow reuse of generated data, explanation of data provenance, and other activities [8]. Resource management, security, and policy. Being famiiliar with today s bandwidth- and data-poor world, users often assume only standard schema and access methods are required to render remote data accessible. But the distributed analysis of large quantities of data is computationally (and bandwidth) intensive, and a high-performance Internet can expose popular data resources to the risk of essentially unlimited loads. Efficient petascale data integration can require the harnessing and coordinated management of multiple computational and network resources at multiple sites. Thus, clients (and brokers acting on their behalf) need to negotiate service level agreements (SLAs) with computers, storage systems, and networks. They also need to deploy applications able to achieve desired end-to-end performance across these resources, as well as monitor performance and adapt to performance problems at either the network or SLA level [7]. For example, an application might request an end-to-end optical network plus associated computing and storage resources, use the resources to integrate remote and local data, then release them. Another effective optimization is to decouple data movement and computation so the data is staged to locations near (in terms of some access cost metric) to where it is required [12]. Data replication and distribution of data across the network [4] are also effective techniques. Along with the data itself, the physical resources employed for data integration are frequently precious and thus subject to access controls. Data-integration middleware must therefore provide comprehensive security, policy, and resource management solutions. These solutions are required at multiple levels, ranging from the individual user ( Can I access this file? ), to the user community ( How many Gb-hours is this community allocated? ), and from the local ( Allocate me 1Gbps bandwidth ), to the end-to-end ( Allocate resources to achieve 10Gbps throughput for this pipeline ), to the global ( Ensure that the most popular data sets are replicated ). Security and policy solutions must address the concerns of both the institutions that own specific resources and the communities wishing to achieve distributed analysis. Implications Two examples from the sciences illustrate some practical implications and applications of these issues: Joins of distributed Earth science data. The National Center for Atmospheric Research s Community Climate Model 3 (CCM3) helps research CO 2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer, and nuclear winter. Scientists regularly want to integrate their data with CCM3 data. For example, they might wish to join their historical data about vegetation levels with CCM3 data to study the effect of global climate change on certain types of vegetation. A typical dataintegration operation is to join a field x i, say, tempera- COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

7 x 200k gettargetregion gettargetregion SDSS getbufferregion DAG target(1).fit 200k buffer(1).fit brgsearch 10M brg(1).par bcgsearch 25k cores(1).par 25k getcorebuffer coresbuffer(1).fit 5M bcgcoalesce parameters.par clusters(1).par 25k getcluster Sloan Data Galaxy cluster size distribution job[id] clusters(2).fit getgalaxies 300k galcatalog(2).fit 300k getcatalog catalog.fit galcatalog.fit t[s] Figure 3. The steps involved in galaxy cluster detection in Sloan data, showing (left) the pipeline and (right) the image data, a small directed acyclic graph (DAG), execution schedule for that DAG, and example output data. ture from the remote CCM3 data set that includes a key k i consisting of a latitude-longitude-time triple (r i, s i,, t i, ) with a field y i from the other data set representing a vegetation level for the same key (r i, s i,, t i, ). In this way, scientists can estimate functional relationships of the form y = f(k; x) to capture how vegetation levels change over time with changes in climatic variables. In one study, the goal was to integrate data on the fly, without co-locating it, in order to obtain an estimate of whether such a relationship is probable, in which case more careful follow-up studies would be needed. The study was performed in conjunction with the igrid 2002 conference in Amsterdam, The Netherlands, evaluating various algorithms for transporting and joining distributed streams of data indexed by latitude, longitude, and time (see Figure 2) [9]. One stream contained temperature and related CCM3 data, the other vegetation levels. One data set was located on a three-node cluster in Chicago, the other on a three-node cluster in Amsterdam. The DataSpace Data Web software was used to move the data across the Atlantic and perform a streaming join of it in Amsterdam; a parallel version of the Simple Available Bandwidth Utilization Library (SABUL) protocol was used for data transport, and DSTP was used to manage keys, metadata, and data. Data was moved at a rate greater than 900Mbps per node (2.4Gbps, or 1TB/hour, with a three-node cluster) and merged at approximately half that speed [9]. This experiment demonstrated conclusively that geographical distance need not be an obstacle to data integration. Galaxy cluster identification in Sloan data. The Sloan Digital Sky Survey (SDSS) is a digital imaging survey that will, by 2007, have mapped a quarter of the sky in five colors with a sensitivity two orders of magnitude greater than previous large sky surveys. The SDSS data is being made available online as both a large collection (~10TB) of images and a smaller set of catalogs (~2TB) containing measurements on each of 250 million detected objects. SDSS is just one example of a growing set of digital sky survey projects that will soon yield an unprecedented international, distributed multi-petabyte collection of digital astronomical data [11]. Another recent experiment [1] showed how this online data could be integrated with distributed computing and storage resources to perform computationally intensive analysis of unprecedented scale. The challenge was to search the Sloan database for galaxy clusters, the largest gravitationally dominated structures in the universe. Software developed for the Gri- PhyN project the so-called Virtual Data 56 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

8 DATA INTEGRATION PROMISES TO HAVE AT LEAST AS GREAT AN EFFECT AS DATA MINING HAS HAD. Toolkit was used to plan, then manage the required workflow (see Figure 3), ultimately involving computational clusters at four sites across the U.S. This illustrates how even large-scale distributed data analysis tasks might become routine once appropriate infrastructure is in place. Conclusion The data tsunami already upon us offers great opportunities for new insight and knowledge but demands significant advances in middleware for integrating data from diverse distributed sources. That s why we have sought to explore here not only the state of the art but likely future directions for this middleware. Data mining emerged from statistics as a new discipline during the past decade, as large data sets became more and more common and the need for new technologies to mine them became critical. In the coming decade, data integration will emerge from distributed computing and data mining, fueled by the increasing number of distributed data sets and enabled by improving network performance. Data integration promises to have at least as great an effect as data mining has had. c References 1. Annis, J., Zhao, Y., Voeckler, J., Wilde, M., Kent, S., and Foster, I. Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey. In Proceedings of SC2002 (Baltimore, MD, Nov ). ACM Press, New York, Atkinson, M., Chervenak, A., Kunszt, P., Narang, I., Paton, N., Pearson, D., Shoshani, A., and Watson, P. Data access, integration, and management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, Baru, C., Moore, R., Rajasekar, A., and Wan, M. The SDSC storage resource broker. In Proceedings of the 8th Annual IBM Centers for Advanced Studies Conference (Toronto, Canada, 1998). 4. Beck, M., Moore, T., and Plank, J. An end-to-end approach to globally scalable network storage In Proceedings of ACM Sigcomm 02 (Pittsburgh, PA, Aug ). ACM Press, 2002,. 5. Berners-Lee, T., Hendler, J., and Lassila, O. The Semantic Web. Sci. Am. 284, 5 (May 2001), Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and Tuecke, S. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific data sets. J. Net. Comput. Applic. 23, 3 (July 2000), Czajkowski, K., Foster, I., and Kesselman, C. Resource and Service Management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, Foster, I., Voeckler, J., Wilde, M., and Zhao, Y. The Virtual Data Grid: A new model and architecture for data-intensive collaboration. In Proceedings of the Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 5 8, 2003). 9. Grossman, R., Gu, Y., Hanley, D., Hong, X., Lillethun, D., Levera, J., Mambretti, J., Mazzucco, M., and Weinberger, J. Experimental studies using photonic data services at igrid Future Gen. Comput. Syst. 19, 6 (2003). 10. Grossman, R. Standards and infrastructures for data mining. Commun. ACM 45, 8 (Aug. 2002), Szalay, A. and Gray, J. The World-Wide Telescope. Science 293 (2001), Thain, D., Basney, J., Son, S.-C., and Livny, M. The Kangaroo approach to data movement on the Grid. In Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (San Francisco, CA, Aug. 7 9). IEEE Computer Society Press, New York, 2001, 7 9. Ian Foster (foster@mcs.anl.gov) is associate division director and senior scientist at Argonne National Laboratory, Argonne, IL, and a professor of computer science at The University of Chicago. Robert L. Grossman (grossman@uic.edu) is director of the Laboratory of Advanced Computing and the National Center for Data Mining at the University of Illinois at Chicago and president of the Two Cultures Group, Chicago. This work is supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, SciDAC Program, U.S. Department of Energy, under Contract W ENG-38, and by the National Science Foundation under contract ITR (GriPhyN) and cooperative agreement ANI (OptIPuter). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ACM /03/1100 $5.00 COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No