Scientific Workflows
|
|
|
- Emery Blake
- 5 years ago
- Views:
Transcription
1 Scientific Software as Workflows: From Discovery to Distribution David Woollard 1,2 Nenad Medvidovic 2 Yolanda Gil 2,3 Chris A. Mattmann 1,2 1 NASA Jet Propulsion Laboratory 4800 Oak Grove Drive, Pasadena, CA University of Southern California 941 W. 37th Place, Los Angeles, CA USC Information Sciences Institute 4676 Admiralty Way, Suite 1001, Marina del Rey, CA Abstract Scientific workflows models of computation that capture the orchestration of scientific codes to conduct in silico research are gaining recognition as an attractive alternative to script-based orchestration. Despite growing interest, there are a number of fundamental challenges that researchers developing scientific workflow technologies must address, including developing the underlying science of scientific workflows. In this article, we present a broad classification of scientific workflow environments based on three major phases of in-silico research as well as highlight active research projects that illustrate this classification With our tripartite classification, based on the the phases of in silico research, scientists will be capable of making more informed decisions regarding the adoption of particular workflow environments. Keywords D.2.6 Programming Environments/Construction Tools D.2.17 Software Construction 1
2 1 Introduction In the last 30 years, science has undergone a radial transformation. In place of test tubes and optics benches, chemists, physicists and experimental scientists in a host of other disciplines are now using computer simulation as a means of discovering and validating new science. This so-called in silico experimentation has been fueled by a number of computer science advances including the ability to archive and distribute massive amounts of data and share hardware resources via the Grid [1]. An activity central to in silico experimentation is orchestration the assemblage of scientific codes into an executable system with which to experiment. Orchestration is a complex task that involves data management (locating data, reformatting, etc.), managing input parameters for executables, and handling dependencies between processing elements. Scientific Workflow Models models of high level scientific tasks as stages of data processing (workflow stages) and the data dependencies between these stages have shown to be a very useful abstraction for scientific orchestration. Workflow Environments process these models, mapping stages onto computational (often Gridded) resources and plan the required data movements to satisfy dependencies in the model. This mapping is often referred to as a workflow instance or concrete workflow. Finally, a Workflow Engine steps through the instance, executing the stages. At NASA s Jet Propulsion Laboratory, for example, scientists and engineers have developed workflow systems to process data from a number of instruments, satellites and rovers. The recently launched Phoenix mission to Mars and an Earth-observing spectral radiometer mission called the Orbiting Carbon Observatory set to launch in 2008 both use workflow systems to process raw instrument data into scientific products for the external science community. Despite these early adopter efforts, scientific workflows have not yet reached a large user base. We have found that there are a number of significant challenges in this domain that have yet to be addressed, despite the large number of scientific workflow systems from which to chose [2]. One such challenge is that there is no standard workflow model, nor is there a fundamental science of scientific workflows. At a recent National Science Foundation workshop chaired by one of the authors, entitled Challenges of Scientific Workflows, this problem was cited as a challenge facing workflow researchers today [3]. One manifestation of this challenge which we will explore in this article is that requirements for scientific workflow systems vary significantly between applications, suggesting that a taxonomy of workflow systems based on the scientific research activities that they support is needed. In this article, we show that classifying workflow systems based on phases of in silico research to which they are applicable forms a classification useful for the scientist interested in adopting workflow technology. 2
3 The discernible phases of in silico research discovery (the rapid investigation of a scientific principle in which hypotheses are formed, tested, and iterated on rapidly), production (the application of a newly formed scientific principle to large data sets for further validation), and distribution (the sharing of resulting data for vetting by the larger scientific community) each have distinct scientific workflow requirements. By making scientists aware of these requirements, we intend to better inform their decision regarding the adoption of a particular workflow technology. In addition to this classification, we will present several current research efforts to further our understanding of the science of scientific workflows in each of the three phases of in silico research. Not only do these research vignettes highlight salient research in this domain, but they serve to illustrate our tripartite classification for readers. 2 How are Workflows Used? In order to explore the requirements scientific workflow environments must meet, it is important that we first discuss the role of the workflow in scientific experimentation more explicitly. By framing our workflow requirements discussion with an understanding of scientists current experimentation practices, we intend to clarify the process of evaluating particular workflow environments for given applications. 2.1 Orchestrating Experiments Scientists have solved the problem of orchestration via a number of methods, including scientific programming environments such as IDL and Matlab. The predominant orchestration mechanism, however, is the script. Scripting languages such as Perl (and more recently Python) have allowed scientists to perform a number of common orchestration tasks, including: specifying overarching process control flow, running pre-compiled executables (via command line or run-time bindings), and reformatting input and output data. In common software engineering parlance, these scripts can be considered glue code between more well-defined software modules. There are a number of problems with script-based orchestration, however, that make workflow modeling an attractive alternative. Because the orchestrating script captures the experiment its setup (in the form of input parameters), its procedure (the control flow, or 3
4 execution steps), and record of results (formatting and cataloging outputs) it is integral to the overall scientific effort and is an important artifact in its own right. Scripts are difficult to maintain and easily obfuscate the original intention of the developer. Lack of inherent structure or design guidelines makes script-based orchestration a largely ad hoc process. Unlike traditional glue code, orchestration in scientific workflows cannot be treated as throw-away! 2.2 A Description of Workflow Environments Workflow environments are used to develop and process workflow models. Each of the process elements of the workflow model is linked to a executable that forms a workflow stage. A workflow engine steps through the workflow model, executing these stages and managing the I/O requirements of each stage as specified by the model. This is akin to the control flow functionality of script-based orchestration. Figure 1: A basic workflow environment. As illustrated in Figure 1, a workflow environment consist of not only a workflow engine, but also a number of ancillary services. These ancillary services envelop many of the noncontrol flow aspects of script-based orchestration, including resource discovery for accessing data, fault handling, and data provenance cataloging. While resource discovery services sometimes handle data reformatting issues, it is also common practice to use a preprocessing workflow stage to handle format mismatches. 4
5 3 Characterizing Workflow System Requirements While good taxonomies of workflow environments exist [2, 4], each takes a bottom-up approach, differentiating workflow approaches by the type of model used (e.g., petri nets versus directed acyclic graphs), or by the particular strategy used in computational resource allocation. While this type of characterization is beneficial to the expert audience, we contend that a top-down taxonomy based on scientific goals addressed is more useful to the scientist interested in adopting workflow technologies. 3.1 Phases of in silico Science Like all scientific endeavors, in silico science has distinctive phases. To take an example from biochemistry, in the early 1920 Macleod and Banting were the first doctors to successfully isolate and extract insulin and are credited with its discovery. Once various efforts to refine their understanding of the structure and properties of insulin were completed, scientists developed a number of techniques for producing insulin in large quantities, eventually settling on genetically engineering synthetic insulin in the late 1970s. Figure 2: Phases of the in silico process. In Figure 2, we have described three phases of in silico science that mostly mirror the processes of in vivo and in vitro science. These phases are discovery, production and distribution. Discovery in this context decribes the phase of development in which algorithms and techniques are tested the scientific solution-space is explored and scientists arrive at a process which yields the desired result. This is what Kepner has described as the lone researcher process [5]. We will describe production, akin to the chemical engineering process that was established to synthesize insulin in large quantities, as the engineering and scientific effort to reproduce the process established in the discovery phase on a large scale. Finally, distribution is a phase in which results of individual processes are shared, validated, and new research goals are formulated. We will now discuss each of these phases in greater detail, describing the discernible characteristics of each type of workflow environment as well as the high-level requirements they support. 5
6 3.2 Discovery Workflow Characteristics Discovery workflows are rapidly re-parameterized, allowing the scientist to explore alternatives quickly in order to iterate over the experiment until hypotheses are validated. Discovery workflow environments support this type of dynamic experimentation. The high-level requirements that a discovery workflow environment should satisfy include aiding scientists in the formulation of abstract workflow models. Discovery workflow environments should transform the abstract models into workflow instances. 3.3 Production Workflow Characteristics As in the case of producing vast quantities of insulin, production workflow environments are focused on repeatability. These environments should be capable of staging remote, high volume data sets, cataloging results, and logging or handling faults. Unlike discovery workflows, production workflows must incorporate substantial data management facilities. Scientists using production workflow environments care less about the means of abstract workflow representation than they do about the ability to automatically reproduce an experiment. The high-level requirements of a production workflow environment include handling the non-orchestration aspects of workflow formation such as data resource discovery and data provenance recording. Additionally, production workflow environments should aid the scientist in converting existing scientific executables into workflow stages, including providing means of accessing ancillary workflow services. 3.4 Distribution Workflow Characteristics Unlike both discovery and production workflow services, distribution workflow environments focus on the retrieval of data. Distribution workflows are used to combine and reformat data sets and deliver these data sets to remote scientists for further analysis. The requirements for distribution workflow environments include support for rapidly specifying abstract workflows (often focusing on graphical techniques for workflow specification), and for remotely executing the resulting abstract workflows. 3.5 Choosing from the Workflow System Spectrum It is important to note that many existing workflow systems share the traits of multiple classes of scientific workflow as we have presented them. Additionally, multiple scientific 6
7 applications the authors have studied fall into more than one of these categories. While both discovery and distribution workflows must aid the scientist in developing abstract workflows, the emphasis of distribution workflows is on the act of specification rather on the dynamism of the resulting workflow. Likewise, production workflow environments and discovery environments both manage data, although production environments must do so with much greater autonomy in order to process large data sets. Rather than present a clean taxonomy for its own sake, our principal goal in this work is to aid scientists in understanding their own scientific workflow requirements, showing how this approach can them help them to chose a workflow environment that caters to their scientific goals. 4 Current Workflow Research In order to illustrate this classification, we will present several current research projects being conducted by the authors. Our intent in this section is two-fold: In showing how each of these projects addresses the high-level requirements of a particular class of workflow application, we will not only (1) highlight salient research topics in the area, but we will also (2) illustrate our classification with real-world workflow environments. 4.1 Workflow Discovery: Wings Workflow discovery is an exploration activity for scientists, whether systematically trying alternatives or haphazardly looking for a surprising result. The discovery of useful workflows is accomplished by trying different combinations and configurations of components as well as using new datasets or new components [4]. An environment for workflow discovery must: 1. assist users in finding components, workflows, and datasets based on desired characteristics or functions; 2. validate the newly created workflows with respect to requirements and constraints of both components and datasets; and 3. facilitate the evolution and versioning of previously created workflows Wings [6, 7] is an example of such a workflow discovery system. Wings represents workflows using semantic metadata properties of both components and datasets, represented in Web Ontology Language (OWL). Wings uses (1) workflow representations that are expressed in a manner that is independent of the execution environment and (2) the Pegasus 7
8 mapping and execution system that submits and monitors workflow executions [6]. In addition, Wings has constructs that express in a compact manner the parallel execution of components to process concurrently subsets of a given dataset [7]. Codes are encapsulated so that any execution requirements (such as target architectures or software library dependencies) as well as input/output dataset requirements and properties are explicitly stated. Code parameters that can be used to configure the codes (for scientists these may correspond to different models in the experiment) are also represented explicitly and recorded within the workflow representation. A model of each code is created to express all such requirements, so they can be used flexibly by the workflow system as workflow components. Component classes are created to capture common properties of alternative codes and their configurations. The methodology for designing workflows, encapsulating components, and formalizing metadata properties is outlined in [8]. Figure 3 shows how Wings represents workflows in data-independent and execution-independent structures. For example, parallel computations over datasets, sketched on the top left, are represented compactly in the Wings workflow on the right. The component representations shown on the bottom left express whether they can process collections or data (e.g., component C-many) or individual datasets (e.g., component C-one) as input. Wings exploits these constraints and represents the parallel computations as a collection of components (e.g., node NC1) that are expanded dynamically into individual executable jobs depending on the size of the dataset bound to the input variables (e.g., to the variable Coll-G). For each of the new data products (e.g., those bound to Coll-Z), Wings generates metadata descriptions based on the metadata of the original datasets and the models of the components. Using these high-level and semantically rich representations, Wings can reason about component and dataset properties to assist the user in composing and validating workflows. Wings also uses these representations to generate the details that Pegasus needs to map the workflows to the execution environment, and to generate metadata descriptions of new datasets that result from workflow execution and that help track provenance of new data products [6]. To facilitate the evolution of workflows, Wings expects each workflow to have its own namespace and ontology definitions. All workflows are represented in OWL, and follow conventions of web markup languages in importing ontologies and namespaces. Each ontology and namespace has a unique identifier and is never modified or deleted; new versions are given new identifiers. Related workflows can import shared ontology definitions and refer to common namespaces. Current research is exploring extensions to these capabilities for more manageable version tracking, particularly in collaborative settings. 8
9 Figure 3: Wings represents workflows in data-independent and execution-independent structures. 9
10 Figure 4: An architecture for scientific workflow stages including connectors to workflow services. 4.2 Production Workflow Environment: SWSA Production workflow environments, as previously stated, incorporate significant data management technology in order to reproduce scientific workflows on very large data sets. Production environments must not only handle the scientific requirements of workflow systems but must also manage the significant engineering challenges of automatic processing of large data sets. Integration of existing scientific codes is a more significant challenge in production workflow environments than in other workflow environments due to the demands of automatic processing. Locating remote data sets, handling faults appropriately, cataloging large volumes of data produced by the system, and managing complex resource mappings of workflow stages onto Grid and MultiGrid environments require scientific workflow stages to access a host of ancillary workflow services. Scientific Workflow Software Architecture, of SWSA, is a current research effort at JPL and USC that aims to provide a software architecture for workflow stages that clearly delineates scientific code from the engineering aspects of the workflow stage. SWSA provides a componentization of the scientific algorithm, separating out workflow service access to specialized software connectors. This separation of concerns allows both scientists and software engineers to converse about workflow stage design without requiring that each domain practitioner become expert in both scientific and engineering fields [9]. SWSA uses a decomposition of an existing scientific code into scientific kernels. Kernels, like code kernels in high performance computing, are code snippets that implement a single scientific concept. In a graph theoretic sense, a kernel is a portion of the call dominance tree 10
11 for the source code s basic blocks that has a single source and a single sink it has internal scope and single entry and exit points. In addition to the scientific kernels, control/data flow of the original source code in the form of a graph of calls made to execute each kernel (its control), and the data dependancies between each kernel (its data-flow) must also be generated. The current SWSA approach requires manual decomposition, though semiautomatic approaches based of software architectural recovery are being explored. In the second step in the architecting process, SWSA wraps these kernels in a component interface, creating Modules as seen in Figure 4. The control flow/data flow of the original program is implemented in a hierarchically-composed exogenous connector as used in [10]. Finally, calls to ancillary workflow services are then made by an invoking connector. An engineer can manage the engineering requirements of the scientific workflow via custom handlers that access services such as data discovery, data provenance registries, and fault registries. 4.3 Distribution Workflow Environment: OODT Once a production workflow has generated the necessary scientific information, that information needs to be appropriately disseminated to the scientific community to effectively communicate the results. As has been noted previously [11], as the study sample increases, so does the chance of discovery. In the Internet age, scientists have begun to lean on Internet technologies and large scale data movement capabilities to take advantage of this principle, sharing their raw science data and analyses with colleagues throughout the world. Distribution workflows are those that enable scientific results to be spread to the science community, leveraging data movement technologies, and (re-)packaging technologies to move data to science users. This process is underpinned by distribution scenarios, which specify important properties of a data distribution (e.g., the total volume to be delivered, the number of delivery intervals, the number of users to send data to, etc.). As shown in Figure 5, distribution workflows typically have four distinct workflow stages: Data Access The scientific information produced by a production workflow must be accessed (e.g., from a database management system, a file system, or some other repository) before it can be disseminated. Data Sub-setting Once accessed, data may need to be (re-)packaged, or subsetted, before being delivered to its external science customers. Movement Technology Selection Before data distribution/dissemination, an appropriate data movement technology (e.g., FTP, HTTP, Bittorrent) must be selected. Data Distribution Once all prior stages are complete, the data can be transferred to the science community. 11
12 To illustrate the complexity of a distribution workflow, consider this representative use case: More than 100 gigabytes (GB) of Planetary Data System (PDS) SPICE kernel datasets (that describe planetary mission navigation information) and PDS data sets need to be subsetted and sent across the wide-area network (WAN) from the PDS Engineering Node at NASA s Jet Propulsion Laboratory (JPL) to the European Space Agency (ESA) and from JPL across the local-area network (LAN) to the (local) PDS navigation (NAIF) node. The data must be delivered securely (using port-based, firewall-pass through) to over 100 different engineering analysts, and project managers, totaling ten distinct types of users. The consensus amongst the users is that they would like the data delivered in no more than 1000 delivery intervals, and due to the underlying network capacities of the dedicated LAN to the NAIF and WAN to ESA, the data must be delivered in 1MB to 1 GB chunks during each interval. The data must reach its destination quickly in both cases (due to the sensitivity of the information, as well as the urgency with which it is needed), so scalability and efficiency are the most desired properties of the transfer. Clearly, such scenarios involve the dissemination of large amounts of information from data producers (e.g., the PDS Engineering Node), to data consumers (e.g., ESA). There are a number of emerging large-scale data dissemination technologies available that purport to transfer data from producers to consumers in an efficient fashion, and to satisfy requirements such as those induced by our example scenario. In our experience however, some technologies are more amenable to different classes of distribution scenarios than others: for example, Grid [1] technologies (such as GridFTP) are particularly well suited to handle fast, reliable, highly-parallel data transfer over the public Internet; however, this benefit comes at the expense of running heavyweight infrastructure (e.g., security trust authorities, Web servers, metadata catalog services, etc.). On the other hand, peer-to-peer technologies, such as Bittorrent, are inherently more light-weight and as fast as grid technologies, but at the expense of reliability and dependability. Distribution workflows must be able to decide, either with user feedback or autonomously, the appropriate data movement technology, and dissemination pattern (e.g., peer-to-peer, client/server) to employ in order to satisfy use cases such as the PDS data movement problem described above. Our recent work has dealt with understanding how to construct distribution workflows that trade the above use cases, and technology choices, in the context of the Object Oriented Data Technology (OODT) data distribution framework [12], at JPL. OODT provides services for data (re-)packaging, subsetting and delivery of large amounts of information, across heterogeneous organizational structures, to users across the world. In addition to OODT, our recent work at the USC has led to the construction of the Data-Intensive 12
13 Production Workflow Data Access Data Subsetting Movement Technology Selection Data Distribution Distribution Workflow Science Community Figure 5: Distribution Workflow Example. Software COnnectors (DISCO) decision-making framework [13], a software extension built on top of OODT that uses architectural information and data distribution metadata to autonomously decide the appropriate data movement technology to employ for a distribution scenario, or class of scenarios. In our experience, the combination of a data movement infrastructure, and a decision-making framework that is able to choose the underlying dissemination pattern and technology to satisfy scenario requirements are essential to successfully implement and design distribution workflows. Figure 1: A basic workflow e 3 Characterizing Workflow System 5 Summary and Future WorkWhile good taxonomies of workflow environments e approach, differentiating workflow approaches by the In this article, we have presented a classification of scientific workflow environments based versus directed acyclic graphs), or by the particular st on the type of scientific workflow they support. This is a top-down classification based on the type of scientific effort the user is interested allocation. in accomplishing, While in this contrast typetoofexisting characterization is b bottom-up approaches. contend that a top-down taxonomy based on scienti In our future work, we plan to continue fundamental the scientist researchinterested into the science in adopting of scientific workflow technolo workflow technology. This includes developing more rigorous workflow models that can help to bridge between discovery and production workflow environments. 3.1 Phases of in silico Science Additionally, we continue to develop approaches that will ease adoption of workflow technology, including improved integration of legacy scientific codes, better methodologies for adopting workflow modeling over script-based orchestration, and decision making frameworks for delivering resulting data products to the external scientific community. Like all scientific endeavors, in silico science has dis from biochemistry, in the early 1920 Macleod and Ban fully isolate and extract insulin and are credited with refine their understanding of the structure and prope tists developed a number of techniques for producing settling on genetically engineering synthetic insulin i 13 In Figure 2, we have described three phases of in s processes of in vivo and in vitro science. These p distribution. Discovery in this context decribes the phase of develo niques are tested the scientific solution-space is ex
14 Acknowledgments We gratefully acknowledge support from the National Science Foundation under grant CCF Additionally, a portion of the research referenced in this article was conducted at the Jet Propulsion Laboratory managed by the California Institute of Technology under a contract with the National Aeronautics and Space Administration. References [1] I. Foster et al. The physiology of the grid: An open grid services architecture for distributed systems integration, [2] J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. Special Issue on Scientific Workflows, ACM SIGMOD Record, 34(3), [3] Y. Gil et al. Examining the challenges of scientific workflows. IEEE Computer, 40(12), [4] Y. Gil. Workflow composition. In D. Gannon, E. Deelman, M. Shields, and I. Taylor, editors, Workflows for e-science. Springer Verlag, [5] J. Kepner. Hpc productivity: An overarching view. In J. Kepner, editor, IJHPCA Special Issue on HPC Productivity, volume 18, [6] J. Kim et al. Provenance trails in the wings/pegasus workflow system. To appear in Concurrency and Computation: Practice and Experience, Special Issue on the First Provenance Challenge, [7] Y. Gil et al. Wings for pegasus: Creating large-scale scientific applications using semantic representations of computational workflows. In Proceedings of the 19th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI), July [8] Y. Gil, P. A. Gonzalez-Calero, and E. Deelman. On the black art of designing computational workflows. In Proceedings of the Second Workshop on Workflows in Support of Large-Scale Science (WORKS 07), in conjunction with the IEEE International Symposium on High Performance Distributed Computing, June [9] D. Woollard. Supporting in silico experimentation via software architecture. Submitted to the 30th International Conference on Software Engineering Doctoral Symposium (ICSE08), May [10] K.-K. Lau, M. Ornaghi, and Z. Wang. A software component model and its preliminary formalisation. In Proceedings of the 4th International Symposium on Formal Methods for Components and Objects, pages Springer-Verlag,
15 [11] D. Crichton et al. A distributed information services architecture to support biomarker discovery in early detection of cancer. In In Proceedings of the 2nd IEEE International Conference on e-science and Grid Computing, page 44, December [12] C. Mattmann, D. Crichton, N. Medvidovic, and S. Hughes. A software architecturebased framework for highly distributed and data intensive scientific applications. In In Proceedings of the 28th International Conference on Software Engineering (ICSE06), Software Engineering Achievements Track, pages , May [13] C. Mattmann, D. Woollard, N. Medvidovic, and R. Mahjourian. Software connector classification and selection for data-intensive systems. In Proceedings of the ICSE 2007 Workshop on Incorporating COTS Software into Software Systems: Tools and Techniques (IWICSS), May
16 Author Biographies David Woollard is a Cognizant Design Engineer at NASA s Jet Propulsion Laboratory and a 3rd year PhD candidate at the University of Southern California. David s research interests are in software architectural support for scientific computing, specifically support for integration of legacy scientific code into workflow environments. He is a primary developer on JPL s OODT project. He is a member of the IEEE. Corresponding Address: NASA Jet Propulsion Laboratory 4800 Oak Grove Drive, Pasadena, CA [email protected] Nenad Medvidović is an Associate Professor in the Computer Science Department at the University of Southern California. His work focuses on software architecture modeling and analysis; middleware facilities for architectural implementation; product-line architectures; architectural styles; and architecture-level support for software development in highly distributed, mobile, resource constrained, and embedded computing environments. He is a member of ACM and IEEE Computer Society. Corresponding Address: University of Southern California 941 W. 37th Place, Los Angeles, CA [email protected] Yolanda Gil is associate division director for research of the Intelligent Systems Division at the University of Southern California s Information Sciences Institute and research associate professor in the Computer Science Department. Her research interests include intelligent interfaces for knowledge-rich problem solving and scientific workflows in distributed environments. Corresponding Address: USC Information Sciences Institute 4676 Admiralty Way, Suite 1001, Marina del Rey, CA [email protected] Chris A. Mattmann received the B.S., M.S. and Ph.D. degrees in Computer Science from the University of Southern California in Los Angeles in 2001, 2003 and 2007, respectively. Chris s research interests are primarily in the areas of software architecture and large-scale data-intensive systems. In addition to his research work, Chris is a Member of Key Staff at NASA s Jet Propulsion Laboratory, working in the area of Instrument and Science Data 16
17 Systems on various earth science missions and informatics tasks. He is a member of the IEEE. Corresponding Address: NASA Jet Propulsion Laboratory 4800 Oak Grove Drive, Pasadena, CA [email protected] 17
Semantic Workflows and the Wings Workflow System
To Appear in AAAI Fall Symposium on Proactive Assistant Agents, Arlington, VA, November 2010. Assisting Scientists with Complex Data Analysis Tasks through Semantic Workflows Yolanda Gil, Varun Ratnakar,
The Big Data Bioinformatics System
The Big Data Bioinformatics System Introduction The purpose of this document is to describe a fictitious bioinformatics research system based in part on the design and implementation of a similar system
Digital libraries of the future and the role of libraries
Digital libraries of the future and the role of libraries Donatella Castelli ISTI-CNR, Pisa, Italy Abstract Purpose: To introduce the digital libraries of the future, their enabling technologies and their
HPC & Visualization. Visualization and High-Performance Computing
HPC & Visualization Visualization and High-Performance Computing Visualization is a critical step in gaining in-depth insight into research problems, empowering understanding that is not possible with
A Characterization Taxonomy for Integrated Management of Modeling and Simulation Tools
A Characterization Taxonomy for Integrated Management of Modeling and Simulation Tools Bobby Hartway AEgis Technologies Group 631 Discovery Drive Huntsville, AL 35806 256-922-0802 [email protected]
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and
Creating A Galactic Plane Atlas With Amazon Web Services
Creating A Galactic Plane Atlas With Amazon Web Services G. Bruce Berriman 1*, Ewa Deelman 2, John Good 1, Gideon Juve 2, Jamie Kinney 3, Ann Merrihew 3, and Mats Rynge 2 1 Infrared Processing and Analysis
Service-Oriented Architecture and its Implications for Software Life Cycle Activities
Service-Oriented Architecture and its Implications for Software Life Cycle Activities Grace A. Lewis Software Engineering Institute Integration of Software-Intensive Systems (ISIS) Initiative Agenda SOA:
Scientific versus Business Workflows
2 Scientific versus Business Workflows Roger Barga and Dennis Gannon The formal concept of a workflow has existed in the business world for a long time. An entire industry of tools and technology devoted
REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES. Jesse Wright Jet Propulsion Laboratory,
REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES Colette Wilklow MS 301-240, Pasadena, CA phone + 1 818 354-4674 fax + 1 818 393-4100 email: [email protected]
A Data Grid Model for Combining Teleradiology and PACS Operations
MEDICAL IMAGING TECHNOLOGY Vol.25 No.1 January 2007 7 特 集 論 文 / 遠 隔 医 療 と 画 像 通 信 A Data Grid Model for Combining Teleradiology and Operations H.K. HUANG *, Brent J. LIU *, Zheng ZHOU *, Jorge DOCUMET
3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India
3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India Call for Papers Cloud computing has emerged as a de facto computing
Data Grids. Lidan Wang April 5, 2007
Data Grids Lidan Wang April 5, 2007 Outline Data-intensive applications Challenges in data access, integration and management in Grid setting Grid services for these data-intensive application Architectural
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
SavvyDox Publishing Augmenting SharePoint and Office 365 Document Content Management Systems
SavvyDox Publishing Augmenting SharePoint and Office 365 Document Content Management Systems Executive Summary This white paper examines the challenges of obtaining timely review feedback and managing
CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences
CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences Manos Papagelis 1, 2, Dimitris Plexousakis 1, 2 and Panagiotis N. Nikolaou 2 1 Institute of Computer Science,
Data-Aware Service Choreographies through Transparent Data Exchange
Institute of Architecture of Application Systems Data-Aware Service Choreographies through Transparent Data Exchange Michael Hahn, Dimka Karastoyanova, and Frank Leymann Institute of Architecture of Application
Maximising the utility of OpeNDAP datasets through the NetCDF4 API
Maximising the utility of OpeNDAP datasets through the NetCDF4 API Stephen Pascoe ([email protected]) Chris Mattmann ([email protected]) Phil Kershaw ([email protected]) Ag
Big Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Data Availability or Data Deluge? Some decades
DRAT: An Unobtrusive, Scalable Approach to Large Scale Software License Analysis
DRAT: An Unobtrusive, Scalable Approach to Large Scale Software License Analysis Chris A. Mattmann 1,2, Ji-Hyun Oh 1,2, Tyler Palsulich 1*, Lewis John McGibbney 1, Yolanda Gil 2,3, Varun Ratnakar 3 1 Jet
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India Call for Papers Colossal Data Analysis and Networking has emerged as a de facto
From Data to Knowledge to Discoveries: Scientific Workflows and Artificial Intelligence
To appear in Scientific Programming, Volume 16, Number 4, 2008. From Data to Knowledge to Discoveries: Scientific Workflows and Artificial Intelligence Yolanda Gil Information Sciences Institute University
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman
International Journal of Electronics and Computer Science Engineering 290 Available Online at www.ijecse.org ISSN- 2277-1956 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks
Managing a Fibre Channel Storage Area Network
Managing a Fibre Channel Storage Area Network Storage Network Management Working Group for Fibre Channel (SNMWG-FC) November 20, 1998 Editor: Steven Wilson Abstract This white paper describes the typical
Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007
Data Management in an International Data Grid Project Timur Chabuk 04/09/2007 Intro LHC opened in 2005 several Petabytes of data per year data created at CERN distributed to Regional Centers all over the
Using Proxies to Accelerate Cloud Applications
Using Proxies to Accelerate Cloud Applications Jon Weissman and Siddharth Ramakrishnan Department of Computer Science and Engineering University of Minnesota, Twin Cities Abstract A rich cloud ecosystem
Service Oriented Architecture
Service Oriented Architecture Charlie Abela Department of Artificial Intelligence [email protected] Last Lecture Web Ontology Language Problems? CSA 3210 Service Oriented Architecture 2 Lecture Outline
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
Data Warehousing and OLAP Technology for Knowledge Discovery
542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories
A Framework for Testing Distributed Healthcare Applications
A Framework for Testing Distributed Healthcare Applications R. Snelick 1, L. Gebase 1, and G. O Brien 1 1 National Institute of Standards and Technology (NIST), Gaithersburg, MD, State, USA Abstract -
Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management
Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management Ram Soma 2, Amol Bakshi 1, Kanwal Gupta 3, Will Da Sie 2, Viktor Prasanna 1 1 University of Southern California,
Organization of VizieR's Catalogs Archival
Organization of VizieR's Catalogs Archival Organization of VizieR's Catalogs Archival Table of Contents Foreword...2 Environment applied to VizieR archives...3 The archive... 3 The producer...3 The user...3
JOURNAL OF OBJECT TECHNOLOGY
JOURNAL OF OBJECT TECHNOLOGY Online at www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2008 Vol. 7, No. 8, November-December 2008 What s Your Information Agenda? Mahesh H. Dodani,
The Big Data methodology in computer vision systems
The Big Data methodology in computer vision systems Popov S.B. Samara State Aerospace University, Image Processing Systems Institute, Russian Academy of Sciences Abstract. I consider the advantages of
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
HYBRID WORKFLOW POLICY MANAGEMENT FOR HEART DISEASE IDENTIFICATION DONG-HYUN KIM *1, WOO-RAM JUNG 1, CHAN-HYUN YOUN 1
HYBRID WORKFLOW POLICY MANAGEMENT FOR HEART DISEASE IDENTIFICATION DONG-HYUN KIM *1, WOO-RAM JUNG 1, CHAN-HYUN YOUN 1 1 Department of Information and Communications Engineering, Korea Advanced Institute
GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington
GEOG 482/582 : GIS Data Management Lesson 10: Enterprise GIS Data Management Strategies Overview Learning Objective Questions: 1. What are challenges for multi-user database environments? 2. What is Enterprise
Information Services for Smart Grids
Smart Grid and Renewable Energy, 2009, 8 12 Published Online September 2009 (http://www.scirp.org/journal/sgre/). ABSTRACT Interconnected and integrated electrical power systems, by their very dynamic
Remote Sensitive Image Stations and Grid Services
International Journal of Grid and Distributed Computing 23 Remote Sensing Images Data Integration Based on the Agent Service Binge Cui, Chuanmin Wang, Qiang Wang College of Information Science and Engineering,
Autonomy for SOHO Ground Operations
From: FLAIRS-01 Proceedings. Copyright 2001, AAAI (www.aaai.org). All rights reserved. Autonomy for SOHO Ground Operations Walt Truszkowski, NASA Goddard Space Flight Center (GSFC) [email protected]
Architecture Frameworks in System Design: Motivation, Theory, and Implementation
Architecture Frameworks in System Design: Motivation, Theory, and Implementation Matthew Richards Research Assistant, SEARI Daniel Hastings Professor, Engineering Systems Division Professor, Dept. of Aeronautics
Strategic Management System for Academic World
2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore Strategic Management System for Academic World Expert System Based on Composition
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace
Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace Beth Plale Indiana University [email protected] LEAD TR 001, V3.0 V3.0 dated January 24, 2007 V2.0 dated August
BUSINESS VALUE OF SEMANTIC TECHNOLOGY
BUSINESS VALUE OF SEMANTIC TECHNOLOGY Preliminary Findings Industry Advisory Council Emerging Technology (ET) SIG Information Sharing & Collaboration Committee July 15, 2005 Mills Davis Managing Director
Techniques to Produce Good Web Service Compositions in The Semantic Grid
Techniques to Produce Good Web Service Compositions in The Semantic Grid Eduardo Blanco Universidad Simón Bolívar, Departamento de Computación y Tecnología de la Información, Apartado 89000, Caracas 1080-A,
ABET General Outcomes. Student Learning Outcomes for BS in Computing
ABET General a. An ability to apply knowledge of computing and mathematics appropriate to the program s student outcomes and to the discipline b. An ability to analyze a problem, and identify and define
Oracle Data Integrator 12c: Integration and Administration
Oracle University Contact Us: +33 15 7602 081 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive data integration
Oracle Data Integrator 11g: Integration and Administration
Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 4108 4709 Oracle Data Integrator 11g: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive
Parallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
Service-Oriented Architecture and Software Engineering
-Oriented Architecture and Software Engineering T-86.5165 Seminar on Enterprise Information Systems (2008) 1.4.2008 Characteristics of SOA The software resources in a SOA are represented as services based
Data Management, Analysis Tools, and Analysis Mechanics
Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem
Using Patterns with WMBv8 and IIBv9
Ben Thompson IBM Integration Bus Architect [email protected] Using Patterns with WMBv8 and IIBv9 Patterns What is a Pattern, and why do I care? Pattern Example File Record Distribution to WMQ Pattern
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee
Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee Technology in Pedagogy, No. 8, April 2012 Written by Kiruthika Ragupathi ([email protected]) Computational thinking is an emerging
SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION
SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION Kirandeep Kaur Khushdeep Kaur Research Scholar Assistant Professor, Department Of Cse, Bhai Maha Singh College Of Engineering, Bhai Maha Singh
Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies
2011 International Conference on Computer Communication and Management Proc.of CSIT vol.5 (2011) (2011) IACSIT Press, Singapore Collaborative & Integrated Network & Systems Management: Management Using
Extend the value of your core business systems.
Legacy systems renovation to SOA September 2006 Extend the value of your core business systems. Transforming legacy applications into an SOA framework Page 2 Contents 2 Unshackling your core business systems
Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments
Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, and Pierangelo Veltri University Magna Græcia of Catanzaro, 88100
An Agent-Based Concept for Problem Management Systems to Enhance Reliability
An Agent-Based Concept for Problem Management Systems to Enhance Reliability H. Wang, N. Jazdi, P. Goehner A defective component in an industrial automation system affects only a limited number of sub
Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing
SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY
SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY Shirley Radack, Editor Computer Security Division Information Technology Laboratory National Institute
CA Process Automation
CA Process Automation Glossary Service Pack 04.0.01 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to as the Documentation ) is
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
Base One's Rich Client Architecture
Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.
User Studies of an Interdependency-Based Interface for Acquiring Problem-Solving Knowledge
User Studies of an Interdependency-Based Interface for Acquiring Problem-Solving Knowledge Jihie Kim and Yolanda Gil USC/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 99 {jihie,gil}@isi.edu
Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens
Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens 1 Optique: Improving the competitiveness of European industry For many
Increasing Development Knowledge with EPFC
The Eclipse Process Framework Composer Increasing Development Knowledge with EPFC Are all your developers on the same page? Are they all using the best practices and the same best practices for agile,
Knowledge-Based Systems Engineering Risk Assessment
Knowledge-Based Systems Engineering Risk Assessment Raymond Madachy, Ricardo Valerdi University of Southern California - Center for Systems and Software Engineering Massachusetts Institute of Technology
IEEE SESC Architecture Planning Group: Action Plan
IEEE SESC Architecture Planning Group: Action Plan Foreward The definition and application of architectural concepts is an important part of the development of software systems engineering products. The
Analysis of Data Mining Concepts in Higher Education with Needs to Najran University
590 Analysis of Data Mining Concepts in Higher Education with Needs to Najran University Mohamed Hussain Tawarish 1, Farooqui Waseemuddin 2 Department of Computer Science, Najran Community College. Najran
KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery
KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery Mario Cannataro 1 and Domenico Talia 2 1 ICAR-CNR 2 DEIS Via P. Bucci, Cubo 41-C University of Calabria 87036 Rende (CS) Via P. Bucci,
Big Data in the context of Preservation and Value Adding
Big Data in the context of Preservation and Value Adding R. Leone, R. Cosac, I. Maggio, D. Iozzino ESRIN 06/11/2013 ESA UNCLASSIFIED Big Data Background ESA/ESRIN organized a 'Big Data from Space' event
International Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
A Survey Study on Monitoring Service for Grid
A Survey Study on Monitoring Service for Grid Erkang You [email protected] ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
Component visualization methods for large legacy software in C/C++
Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University [email protected]
DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID
DAME Astrophysical DAta Mining & Exploration on GRID M. Brescia S. G. Djorgovski G. Longo & DAME Working Group Istituto Nazionale di Astrofisica Astronomical Observatory of Capodimonte, Napoli Department
J12.5 AN EXAMPLE OF NEXTGEN WEATHER INFORMATION INTEGRATION AND MANAGEMENT
J12.5 AN EXAMPLE OF NEXTGEN WEATHER INFORMATION INTEGRATION AND MANAGEMENT Russ Sinclair*, Tom Hicks, Carlos Felix, Keith Bourke Harris Corporation, Melbourne, Florida 1. INTRODUCTION In the NextGen era
Cloud Computing @ JPL Science Data Systems
Cloud Computing @ JPL Science Data Systems Emily Law, GSAW 2011 Outline Science Data Systems (SDS) Space & Earth SDSs SDS Common Architecture Components Key Components using Cloud Computing Use Case 1:
MULTI AGENT-BASED DISTRIBUTED DATA MINING
MULTI AGENT-BASED DISTRIBUTED DATA MINING REECHA B. PRAJAPATI 1, SUMITRA MENARIA 2 Department of Computer Science and Engineering, Parul Institute of Technology, Gujarat Technology University Abstract:
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
A New Cloud-based Deployment of Image Analysis Functionality
243 A New Cloud-based Deployment of Image Analysis Functionality Thomas BAHR 1 and Bill OKUBO 2 1 Exelis Visual Information Solutions GmbH, Gilching/Germany [email protected] 2 Exelis Visual Information
Masters in Information Technology
Computer - Information Technology MSc & MPhil - 2015/6 - July 2015 Masters in Information Technology Programme Requirements Taught Element, and PG Diploma in Information Technology: 120 credits: IS5101
