50 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

Size: px
Start display at page:

Download "50 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM"

Transcription

1 50 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

2 BY IAN FOSTER AND ROBERT L. GROSSMAN DATA INTEGRATION IN A BANDWIDTH-RICH WORLD Inexpensive storage and wide-area bandwidth (with prices for both declining at least as fast as Moore s Law) drive demand for middleware to integrate, correlate, compare, and mine local, remote, and distributed data. Exponential advances in sensors, storage systems, and computers are producing data of unprecedented quantity and quality. Multi-terabyte and even petabyte (1,000TB) data sets are emerging as major assets. For example, the climate science community has access to hundreds of terabytes of observational data from NASA s Earth-observing system and simulation data from high-performance climate models; these data sources can yield new insights into global change. The World-Wide Telescope linking hundreds of digital sky surveys is revolutionizing astronomy [11]. And in industry, multi-terabyte (soon to be petabyte) data warehouses of consumer transactional data are increasingly common. IN-SPIRALING MERGER OF TWO BLACK HOLES. SWIRLING RED TENDRILS ARE OUTWARD-TRAVELING GRAVITATIONAL WAVES. (SIMULATION DATA: PETER DIENER AND THOMAS RADKE, BOTH MAX PLANCK INSTITUTE FOR GRAVITATIONPHYSICS, ALBERT EINSTEIN INSTITUTE/POTSDAM GEANY; VISUALIZATION: JOHN SHALF USING THE VISAPULT TOOL DEVELOPED BY WES BETHEL, LAWRENCE BERKELEY NATIONAL LABORATORY) COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

3 The key to deriving insight and knowledge is often the correlation of data from multiple sources, as these examples show. The traditional paradigm for such syntheses is to gather data at a single location and transform it into a common format prior to exploring it. However, the expense of this approach in terms of network resources has meant that most data is never correlated or compared to other data. In a world of more and more data, storage systems, computers, and networks, it is both necessary and feasible for system architects to think in terms of a new paradigm based on data integration the flexible and managed federation, exploration, and processing of data from many sources. One factor driving this new paradigm is dramatic improvements in network performance. Few Internet1 networks move data at more than a megabit per second (Mbps), taking weeks to move a terabyte. Fortunately, advances in networking technologies are ushering in an era of bandwidth abundance based on Tbps optical backbones providing routine access to end-toend paths of 10Gbps or more a four-ordersof-magnitude improvement (see the article by DeFanti et al. in this section). For example, in 2002 Earth science data striped over a three-node cluster was recently transported at a rate of 2.4Gbps between Amsterdam and Chicago, a terabyte in an hour [9]. Just as critical for effective data integration, and our focus here, is the distributed system middleware beginning to allow distributed communities, or virtual organizations, to access and share data, networks, and other resources in a controlled and secure manner. Recent advances promise to provide required capabilities. For example, Open Grid Services Architecture (OGSA) standards and technologies provide for the secure and reliable virtualization and management of distributed data and computing resources [2, 6]. And data Web infrastructures support discovery, exploration, analysis, integration, and mining of remote and distributed data [10]. Such efforts are pioneering a new generation of distributed data discovery, access, and exploration technologies promising to transform the Internet into a data-integration platform. On it, users will be able to perform sophisticated operations on remote and distributed petascale data sets (see the sidebar Data-Integration Technologies ). Petascale Scenarios The following scenarios illustrate applications impossible today but achievable over optical networks with only the help of data services. Virtual data warehouses. Today, data warehouses are centralized repositories of data used for reporting and querying. High-speed optical networks make it possible for data to instead be stored at its source. When reports are required, bandwidth can be requested, data merged from multiple sources, and reports generated using the most current data. In effect, virtual data warehouses are constructed on the fly. Here, as elsewhere, new data architectures become possible when wide-area networks are able to transport data at speeds comparable to that of a computer s backplane. Security service Discovery R Access Data replication for business continuity. Businesses providing critical infrastructure for disaster recovery and business continuity are increasingly locating secondary, even tertiary, backup facilities far from primary sites. In the past, large data volumes, high transaction speeds, slow networks, and poor distributed data management infrastructure made such distributed architectures difficult or impossible. However, with all-optical networks and distributed data services, it becomes feasible to consider replicating transformations from core systems to remote backup systems. Financial exchanges, telecommunication systems, reservation systems, and dispatch and scheduling operations performed by vendors are examples of systems that can be replicated in this way. Stream-based distributed processing of sensor data. Large centralized detectors and distributed sensor nets in such fields as physics, astronomy, seismology, and national security produce high-volume data streams requiring extensive processing prior to analysis. Today, processing is performed offline, and data sets are pre- R Policy service Figure 1. Major components and activities in a data-integration architecture. Happy users interact with various public or private registries, each providing a particular view of available data, to discover candidate data. They then dispatch requests (dark arrows) to access and/or explore (white circles) remote data. Each such request, along with resulting interstorage-system transfers (dashed and dotted arrows), is subject to resource management controls at various points (labeled ), typically under the control of security and policy services. 52 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

4 IN EFFECT, DATA WAREHOUSES ARE CONSTRUCTED ON THE FLY. pared and distributed only periodically. Optical networks and data-integration services can enable a new paradigm in which even large data sets are continuously updated, so users always have access to the most current data. Data can also be merged from multiple sources, processed in real time, and analyzed for changes, alerts, and other significant patterns. Requirements and Technologies Distributed data sources can be diverse in their formats, schema, quality, access mechanisms, ownership, access policies, and capabilities. Overcoming this multi-tiered Tower of Babel to achieve distributed data integration requires technical solutions and standards in three closely related areas: data discovery and access; data exploration and analysis; and resource management, security, and policy (see Figure 1). Data discovery and access. The first step in integrating data is discovering data that may be relevant, often through middleware that examines metadata. Meta- The following projects are contributing to the development of data-integration middleware: Data Web. This open source, Web-based software supports access, exploration, analysis, integration, and mining of remote and distributed data (see Earth System Grid. This U.S. government-funded project applies data Grid technologies to the integration of Earth system modeling data (see EU DataGrid. This European Union-funded project develops and applies data Grid technologies in highenergy physics and other domains (see Globus Toolkit. This open-source software provides the basic infrastructure for many Grid deployments worldwide (see Open Grid Services Architecture. This integration of Grid and Web services technologies defines standard interfaces and behaviors for distributed system integration and management (see and Data-Integration Technologies OGSA Data Access and Integration. This Global Grid Forum working group defines service-oriented interfaces for manipulating distributed data sources (see Storage Resource Broker. This data access and federation technology provides data-mediation functions for data-intensive science (see Semantic Web. This extension of the Web aims to give information well-defined meaning (see Virtual Data Toolkit. This product of the National Science Foundation s GriPhyN project (see integrates data management and analysis technologies, including the Globus Toolkit, Condor, and Chimera. Web Services. This widely used set of standards specifies how applications define, discover, and access network-accessible services (see /ws). c COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

5 data can be represented, federated, and accessed in a variety of ways. Relevant technologies include Web services mechanisms. For example, there s the Web Services Description Language specifications; Gridenabled data access and integration services [2]; directory services (such the Lightweight Directory Access Protocol); XML and relational databases; Semantic Web technologies [5]; and text-based Web search mechanisms applied to unstructured text-based metadata. Having identified data sets that might be relevant, the next step for the user is to access the data to see whether it is likely to be relevant and actually worth investigating. Data formats, schema, and access mechanisms span a broad range. Widely adopted access mechanisms include: the Open source project for a Network Data Access Protocol (OPeNDAP) in the environmental community; Storage Resource Broker (SRB) [3] in scientific projects; Data Web protocols for data mining (the Data Space Transfer Protocol, or DSTP); and GridFTP for high-performance and striped data movement. The OGSA-based Data Access and Integration (OGSA-DAI) [2] standards emerging from the Global Grid Forum seek to integrate these and other approaches. Data access can demand high transport performance and require parallel data access and movement. For example, if remote data is being delivered at a rate of 1Gbps, and a particular application s data-integration activity involves reading 10 local bytes per remote byte received and performing 100 operations per local byte read, then the application requires 10Gbps local read bandwidth and 1Teraops/sec. of local computing to keep up with data delivery (a substantial and necessarily parallel resource). Striping data using multiple network connections linking pairs of nodes in distributed clusters is becoming a core technique in high-performance data transport [10]. The GridFTP extensions to the popular FTP protocol represent a standard approach to exploiting parallelism in data transfers, allowing multiple data channels to be coordinated via FTP control channel commands. Also relevant is the work on advanced protocols described in the article by Falk et al. in this section. We anticipate the emergence of data access services supporting the flexible creation and manipulation of views on data sources (whether files or tables) and access to those views using a variety of operations, including database-style operations (such as SQL select ) and other more general operations (such as attribute selection, row selection via range queries, and record selection via sampling). Integrating these mechanisms with high-performance transport protocols remains a major unresolved problem. Data exploration and analysis. Data rendered accessible can be analyzed in detail. Here, data exploration services are needed to address the challenges inherent in finding relevant data that can be combined with local data or with other remote data to achieve new discoveries. These services can provide basic statistical summaries, enable visual exploration of data, and support standard exploratory functions (such as building clusters), computing the regression of one variable on another. Efficient integration of distributed data requires Figure 2. Computer scientists have a good understanding of how to perform relational joins when data is at rest in a single location. An important method for data integration is to join distributed data in motion to look for patterns across data sets. An experiment at the igrid 2002 Conference in Amsterdam integrated (on the fly) climate data from Chicago with vegetation data in Amsterdam at transfer rates greater than 2.4Gbps, a land-speed record at the time; integration involved two distributed three-node clusters and employed the SABUL data-transport protocol. CCM3 data in Chicago (x i, r i, s i, t i ) Vegetation data in Amsterdam (y i, r i, s i, t i ) protocols and services for managing the data records constituting data archives. Unlike files of bits, data archives of records have attributes, attribute metadata, keys, and missing values. Mechanisms for providing attribute- and record-based access to remote and distributed data include: SQL-based access methods for relational data; protocols designed to work with remote data (such as the Data Web Transfer Protocol [10], OPeNDAP, and OGSA-DAI [2]); and protocols designed to work with remote and distributed semistructured data (such as XPath). Data Webs support the exploration and mining of distributed data using templated data-mining operations. The transformation, analysis, and synthesis performed during data integration can be complex and 54 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

6 THE EXPERIMENT DEMONSTRATED CONCLUSIVELY THAT GEOGRAPHICAL DISTANCE NEED NOT BE AN OBSTACLE TO DATA INTEGRATION. computationally intensive. Data-transformation primitives incorporated into data middleware cannot capture arbitrary computations but can express many common data-preparation operations [10]. More general workflow services are also required to support the integration and scheduling of arbitrary user- and community-defined transformations. Users benefit from tools that record, organize, and exploit knowledge about how these activities derive new data from old. Virtual data systems aim to capture this information so as to allow reuse of generated data, explanation of data provenance, and other activities [8]. Resource management, security, and policy. Being famiiliar with today s bandwidth- and data-poor world, users often assume only standard schema and access methods are required to render remote data accessible. But the distributed analysis of large quantities of data is computationally (and bandwidth) intensive, and a high-performance Internet can expose popular data resources to the risk of essentially unlimited loads. Efficient petascale data integration can require the harnessing and coordinated management of multiple computational and network resources at multiple sites. Thus, clients (and brokers acting on their behalf) need to negotiate service level agreements (SLAs) with computers, storage systems, and networks. They also need to deploy applications able to achieve desired end-to-end performance across these resources, as well as monitor performance and adapt to performance problems at either the network or SLA level [7]. For example, an application might request an end-to-end optical network plus associated computing and storage resources, use the resources to integrate remote and local data, then release them. Another effective optimization is to decouple data movement and computation so the data is staged to locations near (in terms of some access cost metric) to where it is required [12]. Data replication and distribution of data across the network [4] are also effective techniques. Along with the data itself, the physical resources employed for data integration are frequently precious and thus subject to access controls. Data-integration middleware must therefore provide comprehensive security, policy, and resource management solutions. These solutions are required at multiple levels, ranging from the individual user ( Can I access this file? ), to the user community ( How many Gb-hours is this community allocated? ), and from the local ( Allocate me 1Gbps bandwidth ), to the end-to-end ( Allocate resources to achieve 10Gbps throughput for this pipeline ), to the global ( Ensure that the most popular data sets are replicated ). Security and policy solutions must address the concerns of both the institutions that own specific resources and the communities wishing to achieve distributed analysis. Implications Two examples from the sciences illustrate some practical implications and applications of these issues: Joins of distributed Earth science data. The National Center for Atmospheric Research s Community Climate Model 3 (CCM3) helps research CO 2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer, and nuclear winter. Scientists regularly want to integrate their data with CCM3 data. For example, they might wish to join their historical data about vegetation levels with CCM3 data to study the effect of global climate change on certain types of vegetation. A typical dataintegration operation is to join a field x i, say, tempera- COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

7 x 200k gettargetregion gettargetregion SDSS getbufferregion DAG target(1).fit 200k buffer(1).fit brgsearch 10M brg(1).par bcgsearch 25k cores(1).par 25k getcorebuffer coresbuffer(1).fit 5M bcgcoalesce parameters.par clusters(1).par 25k getcluster Sloan Data Galaxy cluster size distribution job[id] clusters(2).fit getgalaxies 300k galcatalog(2).fit 300k getcatalog catalog.fit galcatalog.fit t[s] Figure 3. The steps involved in galaxy cluster detection in Sloan data, showing (left) the pipeline and (right) the image data, a small directed acyclic graph (DAG), execution schedule for that DAG, and example output data. ture from the remote CCM3 data set that includes a key k i consisting of a latitude-longitude-time triple (r i, s i,, t i, ) with a field y i from the other data set representing a vegetation level for the same key (r i, s i,, t i, ). In this way, scientists can estimate functional relationships of the form y = f(k; x) to capture how vegetation levels change over time with changes in climatic variables. In one study, the goal was to integrate data on the fly, without co-locating it, in order to obtain an estimate of whether such a relationship is probable, in which case more careful follow-up studies would be needed. The study was performed in conjunction with the igrid 2002 conference in Amsterdam, The Netherlands, evaluating various algorithms for transporting and joining distributed streams of data indexed by latitude, longitude, and time (see Figure 2) [9]. One stream contained temperature and related CCM3 data, the other vegetation levels. One data set was located on a three-node cluster in Chicago, the other on a three-node cluster in Amsterdam. The DataSpace Data Web software was used to move the data across the Atlantic and perform a streaming join of it in Amsterdam; a parallel version of the Simple Available Bandwidth Utilization Library (SABUL) protocol was used for data transport, and DSTP was used to manage keys, metadata, and data. Data was moved at a rate greater than 900Mbps per node (2.4Gbps, or 1TB/hour, with a three-node cluster) and merged at approximately half that speed [9]. This experiment demonstrated conclusively that geographical distance need not be an obstacle to data integration. Galaxy cluster identification in Sloan data. The Sloan Digital Sky Survey (SDSS) is a digital imaging survey that will, by 2007, have mapped a quarter of the sky in five colors with a sensitivity two orders of magnitude greater than previous large sky surveys. The SDSS data is being made available online as both a large collection (~10TB) of images and a smaller set of catalogs (~2TB) containing measurements on each of 250 million detected objects. SDSS is just one example of a growing set of digital sky survey projects that will soon yield an unprecedented international, distributed multi-petabyte collection of digital astronomical data [11]. Another recent experiment [1] showed how this online data could be integrated with distributed computing and storage resources to perform computationally intensive analysis of unprecedented scale. The challenge was to search the Sloan database for galaxy clusters, the largest gravitationally dominated structures in the universe. Software developed for the Gri- PhyN project the so-called Virtual Data 56 November 2003/Vol. 46, No. 11 COMMUNICATIONS OF THE ACM

8 DATA INTEGRATION PROMISES TO HAVE AT LEAST AS GREAT AN EFFECT AS DATA MINING HAS HAD. Toolkit was used to plan, then manage the required workflow (see Figure 3), ultimately involving computational clusters at four sites across the U.S. This illustrates how even large-scale distributed data analysis tasks might become routine once appropriate infrastructure is in place. Conclusion The data tsunami already upon us offers great opportunities for new insight and knowledge but demands significant advances in middleware for integrating data from diverse distributed sources. That s why we have sought to explore here not only the state of the art but likely future directions for this middleware. Data mining emerged from statistics as a new discipline during the past decade, as large data sets became more and more common and the need for new technologies to mine them became critical. In the coming decade, data integration will emerge from distributed computing and data mining, fueled by the increasing number of distributed data sets and enabled by improving network performance. Data integration promises to have at least as great an effect as data mining has had. c References 1. Annis, J., Zhao, Y., Voeckler, J., Wilde, M., Kent, S., and Foster, I. Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey. In Proceedings of SC2002 (Baltimore, MD, Nov ). ACM Press, New York, Atkinson, M., Chervenak, A., Kunszt, P., Narang, I., Paton, N., Pearson, D., Shoshani, A., and Watson, P. Data access, integration, and management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, Baru, C., Moore, R., Rajasekar, A., and Wan, M. The SDSC storage resource broker. In Proceedings of the 8th Annual IBM Centers for Advanced Studies Conference (Toronto, Canada, 1998). 4. Beck, M., Moore, T., and Plank, J. An end-to-end approach to globally scalable network storage In Proceedings of ACM Sigcomm 02 (Pittsburgh, PA, Aug ). ACM Press, 2002,. 5. Berners-Lee, T., Hendler, J., and Lassila, O. The Semantic Web. Sci. Am. 284, 5 (May 2001), Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and Tuecke, S. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific data sets. J. Net. Comput. Applic. 23, 3 (July 2000), Czajkowski, K., Foster, I., and Kesselman, C. Resource and Service Management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, Foster, I., Voeckler, J., Wilde, M., and Zhao, Y. The Virtual Data Grid: A new model and architecture for data-intensive collaboration. In Proceedings of the Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 5 8, 2003). 9. Grossman, R., Gu, Y., Hanley, D., Hong, X., Lillethun, D., Levera, J., Mambretti, J., Mazzucco, M., and Weinberger, J. Experimental studies using photonic data services at igrid Future Gen. Comput. Syst. 19, 6 (2003). 10. Grossman, R. Standards and infrastructures for data mining. Commun. ACM 45, 8 (Aug. 2002), Szalay, A. and Gray, J. The World-Wide Telescope. Science 293 (2001), Thain, D., Basney, J., Son, S.-C., and Livny, M. The Kangaroo approach to data movement on the Grid. In Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (San Francisco, CA, Aug. 7 9). IEEE Computer Society Press, New York, 2001, 7 9. Ian Foster (foster@mcs.anl.gov) is associate division director and senior scientist at Argonne National Laboratory, Argonne, IL, and a professor of computer science at The University of Chicago. Robert L. Grossman (grossman@uic.edu) is director of the Laboratory of Advanced Computing and the National Center for Data Mining at the University of Illinois at Chicago and president of the Two Cultures Group, Chicago. This work is supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, SciDAC Program, U.S. Department of Energy, under Contract W ENG-38, and by the National Science Foundation under contract ITR (GriPhyN) and cooperative agreement ANI (OptIPuter). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ACM /03/1100 $5.00 COMMUNICATIONS OF THE ACM November 2003/Vol. 46, No

Open DMIX - Data Integration and Exploration Services for Data Grids, Data Web and Knowledge Grid Applications

Open DMIX - Data Integration and Exploration Services for Data Grids, Data Web and Knowledge Grid Applications Open DMIX - Data Integration and Exploration Services for Data Grids, Data Web and Knowledge Grid Applications Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong and Gokulnath Rao Laboratory for

More information

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

GridFTP: A Data Transfer Protocol for the Grid

GridFTP: A Data Transfer Protocol for the Grid GridFTP: A Data Transfer Protocol for the Grid Grid Forum Data Working Group on GridFTP Bill Allcock, Lee Liming, Steven Tuecke ANL Ann Chervenak USC/ISI Introduction In Grid environments,

More information

A High-Performance Virtual Storage System for Taiwan UniGrid

A High-Performance Virtual Storage System for Taiwan UniGrid Journal of Information Technology and Applications Vol. 1 No. 4 March, 2007, pp. 231-238 A High-Performance Virtual Storage System for Taiwan UniGrid Chien-Min Wang; Chun-Chen Hsu and Jan-Jan Wu Institute

More information

Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey

Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey James Annis 1 Yong Zhao 2 Jens Voeckler 2 Michael Wilde 3 Steve Kent 1 Ian Foster 2,3 1 Experimental Astrophysics, Fermilab,

More information

Information Sciences Institute University of Southern California Los Angeles, CA 90292 {annc, carl}@isi.edu

Information Sciences Institute University of Southern California Los Angeles, CA 90292 {annc, carl}@isi.edu _ Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing Bill Allcock 1 Joe Bester 1 John Bresnahan 1 Ann L. Chervenak 2 Ian Foster 1,3 Carl Kesselman 2 Sam

More information

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research Astrophysics with Terabyte Datasets Alex Szalay, JHU and Jim Gray, Microsoft Research Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral,

More information

Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies

Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies 2011 International Conference on Computer Communication and Management Proc.of CSIT vol.5 (2011) (2011) IACSIT Press, Singapore Collaborative & Integrated Network & Systems Management: Management Using

More information

Web Service Based Data Management for Grid Applications

Web Service Based Data Management for Grid Applications Web Service Based Data Management for Grid Applications T. Boehm Zuse-Institute Berlin (ZIB), Berlin, Germany Abstract Web Services play an important role in providing an interface between end user applications

More information

Data Grids. Lidan Wang April 5, 2007

Data Grids. Lidan Wang April 5, 2007 Data Grids Lidan Wang April 5, 2007 Outline Data-intensive applications Challenges in data access, integration and management in Grid setting Grid services for these data-intensive application Architectural

More information

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets Ann Chervenak Ian Foster $+ Carl Kesselman Charles Salisbury $ Steven Tuecke $ Information

More information

Concepts and Architecture of Grid Computing. Advanced Topics Spring 2008 Prof. Robert van Engelen

Concepts and Architecture of Grid Computing. Advanced Topics Spring 2008 Prof. Robert van Engelen Concepts and Architecture of Grid Computing Advanced Topics Spring 2008 Prof. Robert van Engelen Overview Grid users: who are they? Concept of the Grid Challenges for the Grid Evolution of Grid systems

More information

A Metadata Catalog Service for Data Intensive Applications

A Metadata Catalog Service for Data Intensive Applications A Metadata Catalog Service for Data Intensive Applications Gurmeet Singh, Shishir Bharathi, Ann Chervenak, Ewa Deelman, Carl Kesselman, Mary Manohar, Sonal Patil, Laura Pearlman Information Sciences Institute,

More information

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007 Data Management in an International Data Grid Project Timur Chabuk 04/09/2007 Intro LHC opened in 2005 several Petabytes of data per year data created at CERN distributed to Regional Centers all over the

More information

An approach to grid scheduling by using Condor-G Matchmaking mechanism

An approach to grid scheduling by using Condor-G Matchmaking mechanism An approach to grid scheduling by using Condor-G Matchmaking mechanism E. Imamagic, B. Radic, D. Dobrenic University Computing Centre, University of Zagreb, Croatia {emir.imamagic, branimir.radic, dobrisa.dobrenic}@srce.hr

More information

Using Proxies to Accelerate Cloud Applications

Using Proxies to Accelerate Cloud Applications Using Proxies to Accelerate Cloud Applications Jon Weissman and Siddharth Ramakrishnan Department of Computer Science and Engineering University of Minnesota, Twin Cities Abstract A rich cloud ecosystem

More information

Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey

Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey James Annis 1 Yong Zhao 2 Jens Voeckler 2 Michael Wilde 3 Steve Kent 1 Ian Foster 2,3 1 Experimental Astrophysics, Fermilab,

More information

2. Requirements for Metadata Management on the Grid

2. Requirements for Metadata Management on the Grid Grid-Based Metadata Services Ewa Deelman 1, Gurmeet Singh 1, Malcolm P. Atkinson 2, Ann Chervenak 1, Neil P Chue Hong 3, Carl Kesselman 1, Sonal Patil 1, Laura Pearlman 1, Mei-Hui Su 1 1 Information Sciences

More information

For large geographically dispersed companies, data grids offer an ingenious new model to economically share computing power and storage resources

For large geographically dispersed companies, data grids offer an ingenious new model to economically share computing power and storage resources Data grids for storage http://storagemagazine.techtarget.com/magitem/0,291266,sid35_gci1132545,00.html by: Ray Lucchesi Storage Magazine Issue: Oct 2005 For large geographically dispersed companies, data

More information

Concepts and Architecture of the Grid. Summary of Grid 2, Chapter 4

Concepts and Architecture of the Grid. Summary of Grid 2, Chapter 4 Concepts and Architecture of the Grid Summary of Grid 2, Chapter 4 Concepts of Grid Mantra: Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations Allows

More information

Data Mining for Data Cloud and Compute Cloud

Data Mining for Data Cloud and Compute Cloud Data Mining for Data Cloud and Compute Cloud Prof. Uzma Ali 1, Prof. Punam Khandar 2 Assistant Professor, Dept. Of Computer Application, SRCOEM, Nagpur, India 1 Assistant Professor, Dept. Of Computer Application,

More information

Monitoring Clusters and Grids

Monitoring Clusters and Grids JENNIFER M. SCHOPF AND BEN CLIFFORD Monitoring Clusters and Grids One of the first questions anyone asks when setting up a cluster or a Grid is, How is it running? is inquiry is usually followed by the

More information

Classic Grid Architecture

Classic Grid Architecture Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes

More information

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University

More information

Conquering the Astronomical Data Flood through Machine

Conquering the Astronomical Data Flood through Machine Conquering the Astronomical Data Flood through Machine Learning and Citizen Science Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/ The Problem:

More information

Resource Management on Computational Grids

Resource Management on Computational Grids Univeristà Ca Foscari, Venezia http://www.dsi.unive.it Resource Management on Computational Grids Paolo Palmerini Dottorato di ricerca di Informatica (anno I, ciclo II) email: palmeri@dsi.unive.it 1/29

More information

Data Management Challenges of Data-Intensive Scientific Workflows

Data Management Challenges of Data-Intensive Scientific Workflows Data Management Challenges of Data-Intensive Scientific Workflows Ewa Deelman, Ann Chervenak USC Information Sciences Institute, Marina Del Rey, CA 90292 deelman@isi.edu, annc@isi.edu Abstract Scientific

More information

Evolution of an Inter University Data Grid Architecture in Pakistan

Evolution of an Inter University Data Grid Architecture in Pakistan Evolution of an Inter University Data Grid Architecture in Pakistan Aslam Parvez Memon* and Shakil Akhtar** *SZABIST, Karachi, Pakistan **College of Information Technology, UAE University, UAE Abstract:

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Monitoring Data Archives for Grid Environments

Monitoring Data Archives for Grid Environments Monitoring Data Archives for Grid Environments Jason Lee, Dan Gunter, Martin Stoufer, Brian Tierney Lawrence Berkeley National Laboratory Abstract Developers and users of high-performance distributed systems

More information

DATA MANAGEMENT, CODE DEPLOYMENT, AND SCIENTIFIC VISUALLIZATION TO ENHANCE SCIENTIFIC DISCOVERY IN FUSION RESEARCH THROUGH ADVANCED COMPUTING

DATA MANAGEMENT, CODE DEPLOYMENT, AND SCIENTIFIC VISUALLIZATION TO ENHANCE SCIENTIFIC DISCOVERY IN FUSION RESEARCH THROUGH ADVANCED COMPUTING DATA MANAGEMENT, CODE DEPLOYMENT, AND SCIENTIFIC VISUALLIZATION TO ENHANCE SCIENTIFIC DISCOVERY IN FUSION RESEARCH THROUGH ADVANCED COMPUTING D.P. Schissel, 1 A. Finkelstein, 2 I.T. Foster, 3 T.W. Fredian,

More information

Argonne National Laboratory, Argonne, IL USA 60439

Argonne National Laboratory, Argonne, IL USA 60439 LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R Vijayprasanth 1, R Kavithaa 2,3, and Rajkumar Kettimuthu 2,3 1 Department of Information Technology Coimbatore Institute

More information

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21) CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21) Goal Develop and deploy comprehensive, integrated, sustainable, and secure cyberinfrastructure (CI) to accelerate research

More information

KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery

KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery Mario Cannataro 1 and Domenico Talia 2 1 ICAR-CNR 2 DEIS Via P. Bucci, Cubo 41-C University of Calabria 87036 Rende (CS) Via P. Bucci,

More information

An IDL for Web Services

An IDL for Web Services An IDL for Web Services Interface definitions are needed to allow clients to communicate with web services Interface definitions need to be provided as part of a more general web service description Web

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Remote Sensitive Image Stations and Grid Services

Remote Sensitive Image Stations and Grid Services International Journal of Grid and Distributed Computing 23 Remote Sensing Images Data Integration Based on the Agent Service Binge Cui, Chuanmin Wang, Qiang Wang College of Information Science and Engineering,

More information

Cross-Matching Very Large Datasets

Cross-Matching Very Large Datasets 1 Cross-Matching Very Large Datasets María A. Nieto-Santisteban, Aniruddha R. Thakar, and Alexander S. Szalay Johns Hopkins University Abstract The primary mission of the National Virtual Observatory (NVO)

More information

IBM Global Technology Services November 2009. Successfully implementing a private storage cloud to help reduce total cost of ownership

IBM Global Technology Services November 2009. Successfully implementing a private storage cloud to help reduce total cost of ownership IBM Global Technology Services November 2009 Successfully implementing a private storage cloud to help reduce total cost of ownership Page 2 Contents 2 Executive summary 3 What is a storage cloud? 3 A

More information

Information Sciences Institute University of Southern California Los Angeles, CA 90292 {annc, carl}@isi.edu

Information Sciences Institute University of Southern California Los Angeles, CA 90292 {annc, carl}@isi.edu _ Data Management and Transfer in High-Performance Computational Grid Environments Bill Allcock 1 Joe Bester 1 John Bresnahan 1 Ann L. Chervenak 2 Ian Foster 1,3 Carl Kesselman 2 Sam Meder 1 Veronika Nefedova

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

Data Aggregation and Cloud Computing

Data Aggregation and Cloud Computing Data Intensive Scalable Computing Harnessing the Power of Cloud Computing Randal E. Bryant February, 2009 Our world is awash in data. Millions of devices generate digital data, an estimated one zettabyte

More information

Compute and Storage Clouds Using Wide Area High Performance Networks

Compute and Storage Clouds Using Wide Area High Performance Networks Compute and Storage Clouds Using Wide Area High Performance Networks Robert L. Grossman Yunhong Gu Michael Sabala Wanzhi Zhang National Center for Data Mining University of Illinois at Chicago January

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide

More information

Grid Technology and Information Management for Command and Control

Grid Technology and Information Management for Command and Control Grid Technology and Information Management for Command and Control Dr. Scott E. Spetka Dr. George O. Ramseyer* Dr. Richard W. Linderman* ITT Industries Advanced Engineering and Sciences SUNY Institute

More information

Moving Beyond the Web, a Look at the Potential Benefits of Grid Computing for Future Power Networks

Moving Beyond the Web, a Look at the Potential Benefits of Grid Computing for Future Power Networks Moving Beyond the Web, a Look at the Potential Benefits of Grid Computing for Future Power Networks by Malcolm Irving, Gareth Taylor, and Peter Hobson 1999 ARTVILLE, LLC. THE WORD GRID IN GRID-COMPUTING

More information

An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications

An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications Rajkumar Buyya, Jonathan Giddy, and David Abramson School of Computer Science

More information

Grid Scheduling Dictionary of Terms and Keywords

Grid Scheduling Dictionary of Terms and Keywords Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status

More information

Is Big Data a Big Deal? What Big Data Does to Science

Is Big Data a Big Deal? What Big Data Does to Science Is Big Data a Big Deal? What Big Data Does to Science Netherlands escience Center Wilco Hazeleger Wilco Hazeleger Student @ Wageningen University and Reading University Meteorology PhD @ Utrecht University,

More information

globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory

globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory Computation Institute (CI) Apply to challenging problems

More information

The glite File Transfer Service

The glite File Transfer Service The glite File Transfer Service Peter Kunszt Paolo Badino Ricardo Brito da Rocha James Casey Ákos Frohner Gavin McCance CERN, IT Department 1211 Geneva 23, Switzerland Abstract Transferring data reliably

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Planning for workflow construction and maintenance on the Grid

Planning for workflow construction and maintenance on the Grid Planning for workflow construction and maintenance on the Grid Jim Blythe, Ewa Deelman, Yolanda Gil USC Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 USA {blythe,deelman,gil}@isi.edu

More information

How To Teach Data Science

How To Teach Data Science The Past, Present, and Future of Data Science Education Kirk Borne @KirkDBorne http://kirkborne.net George Mason University School of Physics, Astronomy, & Computational Sciences Outline Research and Application

More information

Next-Generation Networking for Science

Next-Generation Networking for Science Next-Generation Networking for Science ASCAC Presentation March 23, 2011 Program Managers Richard Carlson Thomas Ndousse Presentation

More information

Grid-based Distributed Data Mining Systems, Algorithms and Services

Grid-based Distributed Data Mining Systems, Algorithms and Services Grid-based Distributed Data Mining Systems, Algorithms and Services Domenico Talia Abstract Distribution of data and computation allows for solving larger problems and execute applications that are distributed

More information

IBM Solutions Grid for Business Partners Helping IBM Business Partners to Grid-enable applications for the next phase of e-business on demand

IBM Solutions Grid for Business Partners Helping IBM Business Partners to Grid-enable applications for the next phase of e-business on demand PartnerWorld Developers IBM Solutions Grid for Business Partners Helping IBM Business Partners to Grid-enable applications for the next phase of e-business on demand 2 Introducing the IBM Solutions Grid

More information

August 2009. Transforming your Information Infrastructure with IBM s Storage Cloud Solution

August 2009. Transforming your Information Infrastructure with IBM s Storage Cloud Solution August 2009 Transforming your Information Infrastructure with IBM s Storage Cloud Solution Page 2 Table of Contents Executive summary... 3 Introduction... 4 A Story or three for inspiration... 6 Oops,

More information

Digital libraries of the future and the role of libraries

Digital libraries of the future and the role of libraries Digital libraries of the future and the role of libraries Donatella Castelli ISTI-CNR, Pisa, Italy Abstract Purpose: To introduce the digital libraries of the future, their enabling technologies and their

More information

Artemis: Integrating Scientific Data on the Grid

Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda, Snehal Thakkar, Yolanda Gil, and Ewa Deelman, Artemis: Integrating Scientific Data on the Grid, To Appear In the proceedings of the Sixteenth Innovative Applications of Artificial Intelligence,

More information

Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation

Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation Ian Foster,2 Jens Vöckler 2 Michael Wilde Yong Zhao 2 Mathematics and Computer Science Division, Argonne National

More information

MANAGING AND MINING THE LSST DATA SETS

MANAGING AND MINING THE LSST DATA SETS MANAGING AND MINING THE LSST DATA SETS Astronomy is undergoing an exciting revolution -- a revolution in the way we probe the universe and the way we answer fundamental questions. New technology enables

More information

Knowledge based Replica Management in Data Grid Computation

Knowledge based Replica Management in Data Grid Computation Knowledge based Replica Management in Data Grid Computation Riaz ul Amin 1, A. H. S. Bukhari 2 1 Department of Computer Science University of Glasgow Scotland, UK 2 Faculty of Computer and Emerging Sciences

More information

P u b l i c a t i o n N u m b e r : W P 0 0 0 0 0 0 0 4 R e v. A

P u b l i c a t i o n N u m b e r : W P 0 0 0 0 0 0 0 4 R e v. A P u b l i c a t i o n N u m b e r : W P 0 0 0 0 0 0 0 4 R e v. A FileTek, Inc. 9400 Key West Avenue Rockville, MD 20850 Phone: 301.251.0600 International Headquarters: FileTek Ltd 1 Northumberland Avenue

More information

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment SQLstream s-server The Streaming Big Data Engine for Machine Data Intelligence 2 SQLstream proves 15x faster

More information

Understanding the Value of In-Memory in the IT Landscape

Understanding the Value of In-Memory in the IT Landscape February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

More information

Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware

Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware R. Goranova University of Sofia St. Kliment Ohridski,

More information

A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments

A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou, Antonis Zissimos and Nectarios Koziris National Technical

More information

Streaming Big Data Performance Benchmark. for

Streaming Big Data Performance Benchmark. for Streaming Big Data Performance Benchmark for 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner Static Big Data is a

More information

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate

More information

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International Keys to Successfully Architecting your DSI9000 Virtual Tape Library By Chris Johnson Dynamic Solutions International July 2009 Section 1 Executive Summary Over the last twenty years the problem of data

More information

Figure 1: Illustration of service management conceptual framework

Figure 1: Illustration of service management conceptual framework Dagstuhl Seminar on Service-Oriented Computing Session Summary Service Management Asit Dan, IBM Participants of the Core Group Luciano Baresi, Politecnico di Milano Asit Dan, IBM (Session Lead) Martin

More information

Assessment of RLG Trusted Digital Repository Requirements

Assessment of RLG Trusted Digital Repository Requirements Assessment of RLG Trusted Digital Repository Requirements Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive La Jolla, CA 92093-0505 01 858 534 5073 moore@sdsc.edu ABSTRACT The RLG/NARA trusted

More information

THE CCLRC DATA PORTAL

THE CCLRC DATA PORTAL THE CCLRC DATA PORTAL Glen Drinkwater, Shoaib Sufi CCLRC Daresbury Laboratory, Daresbury, Warrington, Cheshire, WA4 4AD, UK. E-mail: g.j.drinkwater@dl.ac.uk, s.a.sufi@dl.ac.uk Abstract: The project aims

More information

16th International Conference on Control Systems and Computer Science (CSCS16 07)

16th International Conference on Control Systems and Computer Science (CSCS16 07) 16th International Conference on Control Systems and Computer Science (CSCS16 07) TOWARDS AN IO INTENSIVE GRID APPLICATION INSTRUMENTATION IN MEDIOGRID Dacian Tudor 1, Florin Pop 2, Valentin Cristea 2,

More information

T a c k l i ng Big Data w i th High-Performance

T a c k l i ng Big Data w i th High-Performance Worldwide Headquarters: 211 North Union Street, Suite 105, Alexandria, VA 22314, USA P.571.296.8060 F.508.988.7881 www.idc-gi.com T a c k l i ng Big Data w i th High-Performance Computing W H I T E P A

More information

NetApp Big Content Solutions: Agile Infrastructure for Big Data

NetApp Big Content Solutions: Agile Infrastructure for Big Data White Paper NetApp Big Content Solutions: Agile Infrastructure for Big Data Ingo Fuchs, NetApp April 2012 WP-7161 Executive Summary Enterprises are entering a new era of scale, in which the amount of data

More information

Enhanced Enterprise SIP Communication Solutions

Enhanced Enterprise SIP Communication Solutions Enhanced Enterprise SIP Communication Solutions with Avaya Aura and Allstream SIP Trunking An Allstream White Paper 1 Table Of Contents Beyond VoIP 1 SIP Trunking delivers even more benefits 1 Choosing

More information

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand Hari Subramoni *, Ping Lai *, Raj Kettimuthu **, Dhabaleswar. K. (DK) Panda * * Computer Science and Engineering Department

More information

DiPerF: automated DIstributed PERformance testing Framework

DiPerF: automated DIstributed PERformance testing Framework DiPerF: automated DIstributed PERformance testing Framework Ioan Raicu, Catalin Dumitrescu, Matei Ripeanu Distributed Systems Laboratory Computer Science Department University of Chicago Ian Foster Mathematics

More information

IBM Netezza High Capacity Appliance

IBM Netezza High Capacity Appliance IBM Netezza High Capacity Appliance Petascale Data Archival, Analysis and Disaster Recovery Solutions IBM Netezza High Capacity Appliance Highlights: Allows querying and analysis of deep archival data

More information

Bringing Big Data into the Enterprise

Bringing Big Data into the Enterprise Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?

More information

IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand.

IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand. IBM Global Technology Services September 2007 NAS systems scale out to meet Page 2 Contents 2 Introduction 2 Understanding the traditional NAS role 3 Gaining NAS benefits 4 NAS shortcomings in enterprise

More information

Big Data, Big Traffic. And the WAN

Big Data, Big Traffic. And the WAN Big Data, Big Traffic And the WAN Internet Research Group January, 2012 About The Internet Research Group www.irg-intl.com The Internet Research Group (IRG) provides market research and market strategy

More information

An Experience in Accessing Grid Computing Power from Mobile Device with GridLab Mobile Services

An Experience in Accessing Grid Computing Power from Mobile Device with GridLab Mobile Services An Experience in Accessing Grid Computing Power from Mobile Device with GridLab Mobile Services Abstract In this paper review the notion of the use of mobile device in grid computing environment, We describe

More information

Big Data Executive Survey

Big Data Executive Survey Big Data Executive Full Questionnaire Big Date Executive Full Questionnaire Appendix B Questionnaire Welcome The survey has been designed to provide a benchmark for enterprises seeking to understand the

More information

Web Service Robust GridFTP

Web Service Robust GridFTP Web Service Robust GridFTP Sang Lim, Geoffrey Fox, Shrideep Pallickara and Marlon Pierce Community Grid Labs, Indiana University 501 N. Morton St. Suite 224 Bloomington, IN 47404 {sblim, gcf, spallick,

More information

Grid Computing & the Open Grid Services Architecture. Ian Foster Argonne National Laboratory University of Chicago Globus Project

Grid Computing & the Open Grid Services Architecture. Ian Foster Argonne National Laboratory University of Chicago Globus Project Grid Computing & the Open Grid Services Architecture Ian Foster Argonne National Laboratory University of Chicago Globus Project Open Group Grid Conference, Boston, July 21, 2003 2 Is the Grid a) A collaboration

More information

Digital Preservation Lifecycle Management

Digital Preservation Lifecycle Management Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar San Diego Supercomputer Center, University of California,

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Future Applications and the Network they will need. Brian E Carpenter Distinguished Engineer Internet Standards & Technology IBM & January 2002

Future Applications and the Network they will need. Brian E Carpenter Distinguished Engineer Internet Standards & Technology IBM & January 2002 Future Applications and the Network they will need Brian E Carpenter Distinguished Engineer Internet Standards & Technology IBM & January 2002 Topics The Internet today: as far as Web Services The Internet

More information

Application Frameworks for High Performance and Grid Computing

Application Frameworks for High Performance and Grid Computing Application Frameworks for High Performance and Grid Computing Gabrielle Allen Assistant Director for Computing Applications, Center for Computation & Technology Associate Professor, Department of Computer

More information

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER Transform Oil and Gas WHITE PAPER TABLE OF CONTENTS Overview Four Ways to Accelerate the Acquisition of Remote Sensing Data Maximize HPC Utilization Simplify and Optimize Data Distribution Improve Business

More information

A Taxonomy and Survey of Grid Resource Planning and Reservation Systems for Grid Enabled Analysis Environment

A Taxonomy and Survey of Grid Resource Planning and Reservation Systems for Grid Enabled Analysis Environment A Taxonomy and Survey of Grid Resource Planning and Reservation Systems for Grid Enabled Analysis Environment Arshad Ali 3, Ashiq Anjum 3, Atif Mehmood 3, Richard McClatchey 2, Ian Willers 2, Julian Bunn

More information

DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD

DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure Arcot (RAJA) Rajasekar DICE/SDSC/UCSD What is SRB? First Generation Data Grid middleware developed at the San Diego Supercomputer Center

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

The Accounting Information Sharing Model for ShanghaiGrid 1

The Accounting Information Sharing Model for ShanghaiGrid 1 The Accounting Information Sharing Model for ShanghaiGrid 1 Jiadi Yu, Minglu Li, Ying Li, Feng Hong Department of Computer Science and Engineering,Shanghai Jiao Tong University, Shanghai 200030, P.R.China

More information

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence

More information