SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS. A Dissertation. Submitted to the Graduate School

Transcription

1 SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Xiaorong Xiang, B.S., M.S. Gregory R. Madey, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana April 2007

3 SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS Abstract by Xiaorong Xiang Service oriented architecture (SOA) is a new paradigm that originated in industry for future distributed computing. It is recognized as a promising architecture for application integration inside and across organizations. Since their introduction, semantic web and web services technologies are increasingly gaining interest in the implementation of e-science infrastructures. In this dissertation, we survey current research trends and challenges for adopting SOA in general. We present a practical experiment of building a service-oriented system for data integration and analysis using current web services technologies and bioinformatics middleware. The system is enhanced with an ontological model for semantics annotation of services and data. It demonstrates that adopting SOA in the e-science field can accelerate the scientific research process. A new methodology and an enhanced system design is proposed to facilitate the reuse of workflows and verified knowledge.

4 DEDICATION To my parents, my husband, and my son ii

5 CONTENTS FIGURES vii TABLES xi ACKNOWLEDGMENTS xii CHAPTER 1: INTRODUCTION Main contributions of the dissertation Organization of the dissertation CHAPTER 2: RESEARCH ISSUES AND CHALLENGES IN SERVICE- ORIENTED COMPUTING Introduction Overview of related concepts and technologies Web services Semantic web Grid computing Peer-to-peer computing Issues in the service-oriented computing Service description Service discovery Service composition Service execution Service-oriented computing in e-science Conclusion CHAPTER 3: A SERVICE-ORIENTED DATA INTEGRATION AND ANAL- YSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH Introduction Related work Motivation iii

6 3.3.1 Use case Operational barriers System architecture Data storage and access service Service and workflow registry Indexing and querying metadata Service and workflow enactment Implementation Development and deployment tools Services provision Workflow engine Building workflows Web interface Discussion Issues with the first prototype Extension of the system Conclusion CHAPTER 4: EXPLORING THE DEEP PHYLOGENY OF THE PLAS- TIDS WITH THE MOGSERV Introduction System and methods Data model Services Data collection Local query Set management ClustalW Blast Phylip and Paup Data conversion Results of case studies Case study: the rediscovery of Erythrobacter litoralis Summary CHAPTER 5: ONTOLOGICAL REPRESENTATION MODEL The MoG life sciences project and biomedical application Ontological representation model RDF, OWL, and DIG reasoner Generic service description ontology Service domain ontology MoG application domain ontology iv

7 5.3 Implementation Conclusion CHAPTER 6: IMPROVING THE REUSE OF THE SCIENTIFIC WORK- FLOW Introduction A hierarchical workflow structure An enhanced workflow system Knowledge management Knowledge discovery Translation process Service discovery and matchmaking process Knowledge reuse Implementation and evaluation Workflow reuse Related work Conclusion and future Work CHAPTER 7: SUMMARY AND FUTURE WORKS Summary Limitations and future work APPENDIX A: GLOSSARY A.1 Pictures APPENDIX B: MOGSERV MANUAL B.1 Main B.2 Retrieve genome and gene data from NCBI database B.3 Query local database B.4 Set management B.5 Data analysis services B.6 Job mangement APPENDIX C: DEVELOPMENT AND DEPLOYMENT TOOLKITS APPENDIX D: SUPPLEMENTARY MATERIAL FOR CHAPTER 3 AND CHAPTER D.1 Complete genome sequence in XML D.2 Example of a ATP synthase subunit B sequence D.3 Protein name D.4 Syntax of search local database D.5 Workflow of retrieve sequence v

8 D.6 ClustalW input D.7 Blast D.8 PAUP APPENDIX E: SUPPLEMENTARY MATERIAL FOR CHAPTER 5 AND CHAPTER BIBLIOGRAPHY vi

9 FIGURES 1.1 The evolution of the Web, yesterday s web is a repository for text and images; today s web is a platform to publish and access dynamically changing new types of contents provided by a variety of services.[8] Two basic components in a simple service-oriented architecture. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester Web services standards stack includes mutliple layered and interrelated open standards Venn Diagram representation of integration web service, grid computing, semantic web, and peer-to-peer technology into the realization of service-oriented architecure A common service lifecycle in a service-oriented architecure includes service publication, service discovery, and service invocation processes Broker-based service discovery mechanism. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs P2P-based discovery mechanism containing a data layer, a communication layer, and peers that control registries or service providers Summary of existing service discovery systems with different discovery mechanisms mapped relative to three characteristics: degree of decentralization, richness of service descriptions, and static or dynamic A manual phylogenetic data collection and data analysis process. 50 vii

10 3.2 MoGServ System architecture includes a services access client, MoGServ middle layer, and other data and services providers Asynchronized services and workflow invocation model A workflow built using Taverna workbench to get complete genome sequences and specific gene sequences A workflow for querying two subset sequences from local database, filtering out sequences coming from same organism, and doing sequence alignment analysis Abstraction of user defined workflows The growth of sequence databases (NCBI Genebank and EBI Swissprot) and annotations. This figure is from Folker Meyer[57] Entity relationship diagram of the data model in MoGServ created by SQL::Translator A RDF graph model to represent some information for describing the MoG project web site Main concepts and partial relationships defined in the MoG application domain ontology The software components implementation of annotation and querying meta data A four level hierarchical workflow structure representation and transformation of scientific processes An example illustrates the user-oriented workflow definition with different levels of knowledge An enhanced workflow system with two added components, knowledge management and knowledge discovery The mismatching problem may be introduced due to the inaccurate annotation, incomplete semantic annotation, and inaccurate ontological reasoning during the translation process The creation process of connectivity graph when a new service is added in the registry, the connectivity is refined and updated during the workflow translation process The graph representation of a workflow for describing a scientific process A.1 Time line for the origin of life and major invasions giving rise to mitochondria and plastids.[27] viii

11 A.2 Gene transfer to the nucleus. [27] A.3 Symbioses process [69] A.4 ATP Synthase: the wheel that powers life. It is a candidate for ascertainment of deep phylogeny B.1 The main menu of the MoGServ B.2 A web interface provides users a way to define data with interests. 143 B.3 Input the query term from this interface and choose gene or genome database B.4 The results from querying local database B.5 Users may copy, past particular sequences and upload to the local database B.6 Set information B.7 The set filter service is used to find intersection of organisms among mutliple sets B.8 tblastn interface in MoGServ B.9 ClustalW Interface in MoGServ B.10 Job management interface shows the status, input link, output link of a job B.11 An example input of a clustalw analysis, set id is a hot link, users can view sequence information in this set B.12 An example output of a clustalw analysis, users can download, convert, view the results D.1 Phylogenetic tree generated from the PAUP D.2 Phylogenetic tree file generated from the PAUP can be viewed by other program E.1 This is the WSDL description of QueryLocal service hosted in the MoGServ, which provides an operation to create a set in the local database. This operation accepts two parameters and return the set id E.2 One example of using Taverna workbench to create, test, and run workflow. This workflow accepts users input, search the local database, create set, align set using ClustalW, convert the ClustalW result to NEXUS format, which can be fed to PAUP E.3 XScufl workflow format represents the workflow created using the Taverna workbench ix

12 E.4 Annotation of job and set information using ontological model defined. The sample rdf file is displayed using RDF Gravity E.5 Annotation of a service using ontological model defined. The sample rdf file is displayed using RDF Gravity x

13 TABLES 2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR WEB SERVICES EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR ATOMIC AND COMPOSITE SERVICES LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIP- TION PERFORMANCE EVALUATION OF MATCH DETECTION PRO- CESS PERFORMANCE EVALUATION OF PATH SEARCHING PRO- CESS C.1 OPEN SOURCE SOFTWARE PACKAGES USED FOR DEVEL- OPMENT AND DEPLOYMENT D.1 NAME OF ATP SYNTHASE D.2 SYNTAX OF SEARCHING LOCAL DATABASE D.3 INDEXING FIELD OF LOCAL DATABASE xi

14 ACKNOWLEDGMENTS I would like to thank Dr. Gregory Madey for his encouragement and guidance on my research. Thanks for him always saying Life is short and his kindness, patience, and confidence in me. I appreciate him giving his students as much freedom as possible on selecting research topics for our best interests and seeking for collaborative opportunities to help us fulfill our goals. His spirit of never stopping to learn new materials and never afraid of exploring new research areas always encouraged me in the way to finish this dissertation and will encourage me with my future work. Thanks for his efforts on trying to educate us as independent researchers in numerous ways. Many thanks goes to Dr. Jeanne Romero-Severson for providing me use cases and training in the biological field, and her prompt feedback on my work. I would like to thank Dr. Amitabh Chaudhary for answering my questions about algorithms and discussion about my research topics. I would like to thank my committee members Dr. Patrick J. Flynn, Dr. Aaron Striegel, and Dr. Jeanne Romero-Severson for their valuable contributions. I would also like to thank my son for trying hard not to bother me too much while I was busing working and giving me excuses to relax. Many thanks go to my husband, my parents, and my friends for their emotional support, always, no matter how much frustration I had. This research work is partially supported by the Indiana Center for Insect xii

15 Genomics (ICIG) with funding from the Indiana 21st Century fund. xiii

16 CHAPTER 1 INTRODUCTION Since the first generation of the World Wide Web (the Web) appeared in 1990, it mainly served as a repository for text and images presented in HTML format. Nowadays, the Web is evolving as a platform to publish and access dynamically changing new types of content provided by a variety of services that are realized with web-accessible programs, databases, and physical devices. Tim Berners-Lee et. al. [8] presents the evolution of the Web (see Figure 1.1); the authors emphasize the importance of understanding the current, evolving, and potential Web in the article Creating a Science of the Web. The Web has been used in e-commerce and Business-to-Business (B2B) applications to deliver information and provide services to customers and business partners. For example, a travel agency provides services for travelers to view and compare airfare, book tickets and hotel on-line. As the transaction of services between businesses increases, there is a demand of increasing the interoperability between these applications, the service-oriented architecture (SOA) is proposed as an underlying architecture to enhance this capability. With many definitions and non-standard definitions, the service-oriented architecture (SOA) is commonly accepted as a new architectural style that enables the combination and communication among loosely coupled services. These services are described with a standard interface definition that hides the implementation of the language and 1

17 Browser Browser, blog, blog, wiki, wiki, data data integration Privacy, security, accessibility, mobility HTML HTML Interaction Semantic Web Web Web Web Services Multimodal XML XML HTTP HTTP URL URL HTTP, HTTP, SOAP, SOAP, URI URI Yesterday Today This picture is adapted from the article Creating a Science of the Web by Tim Berners-Lee et. al. Figure 1.1. The evolution of the Web, yesterday s web is a repository for text and images; today s web is a platform to publish and access dynamically changing new types of contents provided by a variety of services.[8] platform of services in a SOA. A service can be called to perform a task without the service having pre-knowledge of the calling application, and without the application having or needing knowledge of how the service actually performs its tasks. The realization of a service-oriented architecture is not tied to a specific technology and protocols. The web service standards, including SOAP, WSDL, and UDDI, have been widely accepted as the realization of a SOA with support from a number of tools. Therefore, the service-oriented architecture is often defined as services exposed using this web service protocol stack. A SOA based sys- 2

18 tem can therefore be referred to as a system developed using these technologies. Building a SOA based system can help businesses respond more quickly and costeffectively to the changing market conditions. It promotes reuse of existing legacy applications as services and simplifies the interconnection of distributed business processes inside organizations or across organization boundaries. As stated in the article [8], the Web has changed the ways scientists communicate, collaborate, and educate. The evolving process of using the Web in the e-science field is similar to the evolving process of using the Web in the e- business domain. The effort of building the e-science infrastructure started from developing gateways or portals that provide access to integrated databases and computing resources behind a web-based user interface in multiple scientific fields. Examples of this kind of science include social simulations, physics, environmental sciences and bioinformatics. This infrastructure has been used to solve problems such as distributed physical or astronomic data analysis, and remote access of the information source and simulations. It facilitates the use of the computational resources located in different physical sites, thereby allowing users at different locations to easily share information and communicate with each other. More recently, the service-oriented architecture along with the combination of semantic web, Peer-to-peer (P2P) computing, and grid computing technologies are being identified as promising ways to build such infrastructures for supporting e-science by providing access to heterogeneous computation resources and integration of distributed scientific and engineering applications developed by individual scientists and groups [91] [93]. With the promising future of adopting the service-oriented architecture in e- Science and e-business, a number of challenges arise in term of integrating inde- 3

19 pendently developed data systems without requiring global agreements as to terms and concepts, efficient allocation of computation resources, security and privacy issues of accessing shared data resources. These challenges attract researchers from diverse research areas such as information retrieval, database system, artificial intelligence, software engineering, and distributed computing. 1.1 Main contributions of the dissertation Our research work starts from an investigation of current research trends and challenges in the SOA area. In order to discover the best practices for building SOA based systems, we demonstrate our design and implementation of a SOA based system to support scientific research and increase productivity. It serves as a prototype for our future research work in this field as well as an in-silico investigation platform for scientists. A particular scientific domain studying the deep phylogeny of the plant chloroplast is applied in this prototype. This application shows that a SOA based system can help scientists achieve a research goal that it is difficult and almost impossible without this system. We conduct our research from both practical and theoretical aspects. We propose a hierarchical structure for workflow by integrating semantic web technology to improve the reuse of workflows. To address the security and resource allocation issue, we propose integrating the current system with an existing grid computing platform. The main contributions of this dissertation are: A survey and analysis of current trends and research challenges in the service-oriented architecture: Grid computing, peer-to-peer computing (P2P), and semantic web technologies are related to SOA. A recently proposed grid standard, Open Grid Service Architecture (OGSA), built upon the service-oriented 4

20 architecture, demonstrates the convergence of grid computing with SOA. Semantic web technology is used in grid services and SOA to enhance the automation of scientific and engineering computational workflows. Applying P2P technology in SOA makes service discovery and enactment more scalable than centralized approaches. Much research has been done exploring the convergence of these technologies so as to make this new distributed computing paradigm successful. We present our investigation of the research issues and challenges in SOA. Our discussion of open issues and future research trends focuses on several critical aspects in SOA: service discovery, service composition, and service enactment. A Service-oriented data integration and analysis environment for In Silico experiments and bioinformatics research: As more public data providers begin to provide their data in web service format in order to facilitate better data integration in bioinformatics community, we designed and implemented a service-oriented architecture that integrates the data and services to support a deep phylogenetic study. This software environment focuses on representing both data access and data analysis as web services. We believe with this common interface, it will be easy for other researchers who are interested in deep phylogenetic analysis to integrate our data and services into their applications. Based on a first prototype, we discuss several issues in the implementation and indicate the possible integration with semantic web and grid computing technologies to address these limitations. We present a practical experiment of building a service-oriented system upon current web services technologies and bioinformatics middleware. The system allows scientists to extract data from heterogeneous data sources and to generate phylogenetic comparisons automatically. This can be difficult to accomplish using manual search tools since sequence data is rapidly 5

21 accumulating and the process can be long and tedious. An application for exploring the deep phylogeny of the plastids with the SOA based system: To serve as an example and proof of concept that the service-oriented architecture can help scientists increase their productivity and solve more complex problems than possible with the traditional approaches, we apply several use cases on the system. We detail the services provided in this environment and illustrate the results which demonstrate that the environment can help support scientific analysis and make new discoveries. A methodology and a novel approach to facilitate the reuse of workflow and composition of services: Most current practical methodologies for creating workflows relies heavily on users having complete knowledge and understanding of individual services at a low-level description. Using semantic web technology, services can be described with rich semantics. Recent research has focused on supporting users in the discovery and composition of services by using rich service annotations. Users can choose to encapsulate a service in a workflow to achieve particular goals based on the conceptual service definition in semiautomatic and automatic ways. Most current practical methodologies for workflow creation pursue this using a semi-automatic way that allows users to discover and select appropriate services to include in a workflow based on the semantic and conceptual service definition. This effort lifts the load of requirement on bioinformatics researchers of having detailed knowledge and understanding of each tool, service, and data types. Instead, more complex middleware is used to assist with the composition process and resolve the incompatibility between two given services. Few approaches consider the potential of reuse of existing workflows or partial reuse of these workflows. We present a hierarchical workflow structure 6

22 with a four level representation of workflow: abstract workflow, concrete workflow, optimal workflow, and workflow instance. This four level representation of workflow provides more flexibility for the reuse of existing workflows. We believe that reuse of complete or partial workflows takes advantage of the verified knowledge learned in practice and can increase the soundness of the composed workflow. We proposed an ontological representation model of data and services as well as an approach that uses a graph matching algorithm to find similar workflows with semantic annotation. 1.2 Organization of the dissertation The rest of this dissertation is organized as follows: Chapter 2 introduces several concepts and technologies related to SOA and discusses related research issues and challenges. Chapter 3 presents the design and implementation of a SOA based system for supporting bioinformatics research. Chapter 4 demonstrates a particular application that uses this system to discover new phylogenetic knowledge. Chapter 5 presents an ontological model to annotate services and data. This semantically enriched data allows easier reuse, sharing, and experiments involving search to be conducted. Chapter 6 proposes a methodology and a novel approach that can facilitate the reuse of workflow and composition of services. Chapter 7 summarizes the dissertation and identifies potential future work. 7

23 CHAPTER 2 RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED COMPUTING 2.1 Introduction The evolution of computing systems progressed through monolithic, clientserver, 3-Tier to N-Tier architectures. The N-Tier architecture layers request and response calls among applications that may reside on multiple sites. Serviceoriented computing (SOC), an term frequently used interchangeably with the service-oriented architecure (SOA), involves service layers, functionality, and roles as described by SOA [70]. SOA can be considered as a conceptual description of a concrete implementation of a service-oriented computing infrastructure. It is an emerging paradigm for distributed computing intended to enable systematic application-to-application interaction. Services are basic units on a serviceoriented computing platform. They are autonomous, platform-independent software components that can be described, published, discovered, invoked, and composed using standard protocols within and across organizational boundaries. A service is a piece of work done by a Service provider in order to provide desired results for a Service requester. Service providers and requesters are roles played by software agents on behalf of their owners. The goal of this new distributed computing architecture is to enable interaction among loosely-coupled software agents in a flexible and effective way. 8

24 SOC has been adopted in portal design, e-commerce, e-science, legacy system integration, and grid computing. One example is the integration of engineering design processes, such as automobile and aircraft design, which typically involve several partners located at different locations. These partners may be both cooperative and competitive. Successful engineering design requires well-coordinated interactions between individuals or teams in specialized knowledge domains, information exchange, models, and integration to achieve an optimal goal. However, there may be a significant part of design models and tools containing proprietary information that cannot be disclosed. Also, these models and tools are normally written in a variety of programming languages and run on different platforms. With service-oriented computing technologies, these models and tools can be treated as black boxes and run at their original locations [5] [43]. Reusability, interoperability, security, and easy maintenance are major potential benefits of SOC. Reusability services provide a higher-level standard abstraction that allows the reuse of existing software. Interoperability The standard abstraction of services enables the interoperation of software produced by different programmers and improves productivity. Security With the standard abstraction of services, software can be viewed as a black box. The internal implementations or algorithms are not accessible to competitive partners. Maintenance With the standard abstraction of service, changes to the underlying implementation will adversely impact the use of the services. 9

25 While the potential benefits of SOC are compelling, successful service-oriented implementation requires solving several issues and challenges arising from these promising features. These issues and challenges include service discovery, service composition, and service invocation; monitoring the execution of services; methodologies supporting services development, evaluation, and life-cycle management; approaches to guarantee quality, security, and reliability of services. These challenges attract researchers from diverse research areas such as information retrieval, database systems, artificial intelligence, software engineering, and distributed computing. In this chapter 1, we introduce several concepts and technologies related to SOC and discuss related research issues and challenges. 2.2 Overview of related concepts and technologies Several definitions of SOA are available; the W3C defines SOA as a form of distributed systems architecture with the following properties: [105] The service is an abstracted, logical view of actual programs, databases, and business processes. A service or a function is described using a description language. Services tend to use a small number of operations with relatively large and complex messages. Services tend to be oriented toward use over a network. 1 Portions of this chapter appear in A semantic web services enabled web portal architecture, International Conference of Web Services (ICWS2004)[108] 10

26 Messages are sent in a platform-neutral, standardized format, such as XML, through the interface. XML is the most obvious format. The service is implemented as a software agent. The service is formally defined in terms of the messages exchanged between provider agents and requester agents, and not the properties of the agents themselves. By avoiding any knowledge of the internal structure of an agent, one can incorporate any software component or application that can be wrapped in message handling code that allows it to adhere to the formal service definition. There are two fundenmental components in a basic service-oriented architecture as shown in Figure 2.1. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester. The request and subsequent response connections are defined in some way that is understandable to both the service requester and service provider Web services Although there is no standard definition of web services, a web service is generally considered as one type of realization of SOA. Among various definitions, we refer to the definition from W3C: A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards

27 Service description Software Agent Implement The service Send request in XML format Internet Software Agent Has knowledge Of the service In terns of the Description not The implementation Service Provider Return results in XML format Service Requester Figure 2.1. Two basic components in a simple service-oriented architecture. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester. Concrete software agents that implement an abstract service interface can be written in different programming languages and can run on different platforms. Since these concrete agents implement the same function defined in the abstract interface, any change of underlying implementation will not effect on the use of the service. A web service architecture is based upon many layered and interrelated open standard and web technologies as shown in Figure 2.2. The Web Service Description Language (WSDL) defines the abstract interface of services. The Simple Object Access Protocol (SOAP) is a protocol for exchanging messages among requesters and providers. Universal Description, Discovery and Integration (UDDI) provides a standard registry for publishing, discovery, and reuse of web services. WSDL, SOAP, and UDDI are core standards based on fundamental web technologies including XML, TCP/IP, FTP and etc. There are also emerging standards proposed for defining business, scientific, or engineering processes, 12

28 transactions, and security, e.g., BPEL4WS, WS-I. Two main styles of web services are available: SOAP web services and REST (Representational State Transfer) 3 web services. In this dissertation, we use the term web services to mean SOAP style web services. Additional Additional WS* WS* Standards Standards Business Business Process Process Execution Execution BPEL4WS, BPEL4WS, WFML, WFML, WSFL, WSFL, BizTalk, BizTalk, Service Service Publishing Publishing & Discovery Discovery UDDI UDDI Transactions Transactions Management Management Security Security Services Services Description Description WSDL WSDL Services Services Communication Communication SOAP SOAP Meta Meta Language Language XML XML Network Network Transport Transport Protocols Protocols TCP/IP, TCP/IP, HTTP, HTTP, SMTP, SMTP, FTP, FTP, etc etc Figure 2.2. Web services standards stack includes mutliple layered and interrelated open standards Semantic web The vision of the semantic web is to represent units of web-based information with well-defined and machine-understandable semantics so that intelligent software agents can autonomously process them [7]. This information, including these

29 abstract description of services, must be defined and linked in such a way that it can be used for automation, sharing, integration, and reuse even when these software agents are designed, developed, and owned by different groups or individuals. SOA, more specifically web services, becomes a key component to realize the vision of semantic web since most web sites on todays web do not merely provide static information but allow users to interact and generate dynamic information through services. To make use of a web service, a software agent needs a computer-interpretable description of the service. Adding meaningful descriptions to the interface using semantic web technology can avoid ambiguous interpretations of information and service descriptions and increase the soundness of the results provided by service providers. The combination of these two technologies results in the emergence of a new generation of web services called semantic web services [54]. The proposed standards for knowledge sharing and reuse in the semantic web range from the Resource Description Framework (RDF) to the Web Ontology Language (OWL) [67]. These two standards have become W3C recommendations. The appearance of open source tools that support creation, parsing, and reasoning using these standards makes the addition of semantic web technology into SOC feasible Grid computing Grid computing [32] is a computing platform that is intended to integrate resources (both data and computational resources) from different organizations, called virtual organizations, in a shared, coordinated and collaborative way to solve large-scale science and engineering problems. The Globus toolkit [97] is one implementation of the specifications for grid computing. It has become the 14

30 standard for grid middleware. Open Grid Service Architecture (OGSA), built upon the service-oriented architecture, describes a service-oriented architecture for grid computing. The Open Grid Services Architecture (OGSA) describes an architecture for a service-oriented grid computing environment for business and scientific use, developed within the Global Grid Forum (GGF). OGSA is based on several other Web service technologies, notably WSDL and SOAP. It is a distributed interaction and computing architecture based around services, assuring interoperability on heterogeneous systems so that different types of resources can communicate and share information. The major goal of the grid computing platform is to provide an easy-to-use and flexible computing infrastructure for supporting e-science. The goal of e-science is to offer scientists and engineers an effective way to generate, analyze, and share their experiments, data, instruments, computational tools, and results. Seamless automation of the scientific process becomes a major gap between the vision and reality. Grid computing shares some of problems and technical challenges with service-oriented computing in general. Incorporating semantic web technologies into grid computing bring us a new concept, the semantic grid [21], which intends to minimize this gap and solve the problem of achieving seamless integration and automation of scientific and engineering workflows Peer-to-peer computing Peer-to-Peer (P2P) computing has received significant attention due to the popularity of P2P file sharing system such as Napster, Gnutella, Freenet, Morpheus, BitTorrent, and KaZaa. Peers are autonomous agents and exchange information in completely decentralized manner. P2P architecture does not have 15

31 a single point of failure. Since nodes contact with each other directly, the information they receive is up-to-date. The P2P model can provide an alternative for service discovery dynamically without relying on centralized registries. The P2P model also provides an alternative for interaction between web services. We discuss the research done on this direction in the following sections. Semantic web technology enhances the capability of automation in SOA and grid computing. Grid computing building upon SOA increases the flexibility. P2P computing model increases the scalability and reliability. Figure 2.3 demonstrates an overview of current research trends that intend to use these technologies together. 2.3 Issues in the service-oriented computing Figure 2.4 shows service publication, service discovery, and service invocation stages in the life cycle of a service. This process involves three roles in the SOA: service provider, service requester, and service discovery system. Service providers create services and provide platforms to execute these services. Service requesters query the service discovery system to find appropriate services. To enable service requesters to find services, service providers need to publish their services interface in a publicly available location. Specifying the capability and quality of services, and finding a matched service based on these descriptions are usually done as two separate activities. The more information that is given for describing services, the more accurate are the matched results that are returned. Services can be categorized into simple services (atomic services) and complex services (composite services). Generating and executing a composite service to solve a complicated problem is an important feature leading to the adoption of SOA. 16

32 Figure 2.3. Venn Diagram representation of integration web service, grid computing, semantic web, and peer-to-peer technology into the realization of service-oriented architecure In the following sections, we discuss several active research issues in SOA, service description, service discovery, service composition, and service execution Service description One requirement of the services oriented architecture is to provide meaningful descriptions for services so that software agents can understand their features and learn how to interact with them. A service description gives a formal representation for properties of a service. These properties can be classified into funcational and non-functional properties. Functional properties contain the details of a service interface and service 17

33 Discovery 2 3 Service Consumer 5 Invoke 4 Service Broker 1 Publish Service Provider Figure 2.4. A common service lifecycle in a service-oriented architecure includes service publication, service discovery, and service invocation processes. behavior including data types, operations, transport protocol information, and binding address. WSDL is the first W3C standard that is widely used for service descriptions. There may be multiple service providers who offer the same functionalities defined in a service interface. Determining and choosing the best service becomes important for service requesters. The information in WSDL descriptions is not sufficient for ranking best services. Non-functional properties including specification of the cost, performance, security, and trustiness of a service are introduced for measuring the Quality of Services (QoS). There are many aspect of QoS that can be organized into categories with a set of quantifiable parameters [75]. The best service may have different meanings for different requesters. One may prefer security over cost while the other may prefer lower cost over performance. Measurements of these non-functional properties can be achieved using statistical analysis, data mining, and text mining technologies. It is normally done by a third-party through the collection of subjective evaluations from requesters. This information dynamically changes over time. 18

34 Pure syntactic descriptions of services require requesters to fully understand the capability of a service before using it. The selection of a web service among several ones with similar WSDL descriptions requires more information than what WSDL actually defines. The semantic web, supported by the use of an ontology, is likely to provide better qualitative and scalable solutions to overcome these issues. There are two directions to enhance the semantics in the web service description (See Table 2.1). 1) enhance the WSDL description. The Semantic Annotations for Web Services Description Language Working Group [81] has the objective to develop a mechanism to enable annotation of Web services description. This mechanism will take advantage of the existing WSDL standards (WSDL 2.0) to build a simple and generic support for semantic in Web services. Some systems [54] [55] define an ontology for web services using emerging languages, such as DAML+OIL and OWL. 2) Second, the W3C recently proposed OWL-S to provide the ontology description of web services using OWL. OWL-S enables description of not only the functional properties of a service, but also the non-functional properties. This domain-independent service ontology is augmented by domain-specific ontologies in real applications. Enhancing service descriptions with ontological representations increases the cost and complexity of services annotation from several aspects. Creation of domain ontology Use of ontologies is considered to be the most promising basis for defining the semantics of objects and allowing meaningful information exchange among machines and humans. A commonly used definition of ontology is a specification of a conceptualization [40]. An ontology is intended to give a concise, uniform, and declarative description of information and knowledge that is interesting and useful to a community 19

35 TABLE 2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR WEB SERVICES Description methods Representation Challenges syntactic WSDL No representation of non-functional properties, not sufficient in representing meaningful description, no representation for process, only supporting the keywords search semantic domain ontology + WSDL domain ontology + OWL-S No representation for process, complexity of services annotation Complexity of services annotation of users, using a common vocabulary and language. Construction of a knowledge base involves investigating a particular domain, determining important concepts in that domain, and creating a formal representation (ontology) of the objects and relations in the domain. A general ontology represents a broad selection of objects and relations at a higherlevel of abstraction [79]. Miller et al. [59] investigate ontologies for simulation modeling. Christley et al. [15] presents an ontology for agent-based modeling and simulation. An ontology is normally defined and revised (if needed) by an authority. Usually the authority needs to collaborate with the real experts in the domain before or during the process of creating formal representations. Largescale ontologies can be constructed by publishing a prototype ontology for the research community. The Gene Ontology (GO) Consortium produces 20

36 a controlled vocabulary for classifying gene product attributes, molecular functions, cellular components and biological process [35] in the biological sciences field. It consists of terms (as of September 27, 2004) and terms (as of March 11, 2007). Integration of ontologies Vast amounts of information may come from many different ontologies. For this reason and because many heterogeneous data repositories are developed by different research groups and reside on different research institutes and organizations, it is impossible to process this information and data without the knowledge of the semantic mapping between them. Much research has been done to explore the mapping and matching of concepts, and integrating different ontologies using sophisticated algorithms and AI techniques, such as machine learning [25][62]. There are two approaches for ontology integration. One approach involves integration of different ontologies that are developed by different groups for data representation into a common global ontology. While this approach makes the information correlation in the query processing easier, it increases the complexity of integrating the ontologies and maintaining consistency among concepts. The other approach is interoperation across different ontologies via terminological relationships between terms instead of integration of ontologies into a global one [56] [66]. Interontological relationships are specified using description logics in an interontological relationships manager to handle vocabulary heterogeneity between ontologies. Although the adoption of this approach increases the scalability, extensibility, and maintainability, it shifts the burden to the interoperation mechanisms. Annotation of services The annotation of services using ontologies is gener- 21

37 ally done manually. It is a complex process since there may be multiple ontologies related to a single service. These ontologies may be developed by different groups. Each group may represents the same concept using different vocabulary or different concepts are presented using the same vocabulary. Some systems, such as MWSAF (Meteor-S web service annotation framework) [71] provides graphical tools that enable users to annotate existing web services description with ontologies in a semi-automatical way using AI technologies such as machine learning. The IBM ETTK [30] technology provides a set of toolkits including a graphical editor for annotating service compatible with WSDL-S Service discovery Without prior knowledge of a service, service requesters may not know the location or even the existence of services they desired. A goal of the service discovery process is to find services that are best suited for the requirement of the requester. A basic service discovery process can be described as follows. 1. Service providers provide descriptions of their services and advertise these services in a service registry. A service registry is a service discovery system that consists of mechanism for supporting efficient searching appropriate services and physical spaces for storing characteristics of services. UDDI is a registry standard. 2. Service requesters request desired services using keywords or complicated query languages. 22

38 3. A service discovery system accepts requests from requesters. It searches service descriptions in its database and tries to find services that match requests. This process is also called matchmaking. As the number of web services grows, new registries appear as needed. A service may be registered in several registries. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs (See Figure 2.5). In this mechanism, the broker may issue the request to multiple registries in parallel, however, there is still a communication bottleneck to the broker and a single point failure may occur. An alternative of the centralized discovery mechanism is the P2P based discovery mechanism. In this approach, each service provider acts as a peer in the P2P network. Each provider has its own way to store information about other service providers, called neighbors, and provides the resources to relay or pass information through. A network like a social network is eventually formed. At the discovery process, a requester queries its neighborhoods for searching a desired service, the query propagates through the network until a suitable one is found or terminates [105]. This approach provides higher reliability than a centralized approach. It avoids the single point of failure and the latency of providing up-todate description for updated services. However, since each service provider is a peer, a huge peer community may result in inefficient search. Instead of treating every provider as peer, each registry can act as a peer in the network to overcome the problem. Much research has been done for realizing a P2P discovery mechanism. Schmidt and Parashar present a P2P based keyword web service discovery system on the 23

39 Service Providers A B C Publish services into one or multiple service registries Service Registries a b c Service Discovery Broker Broker Handle queries queries from from requesters Translate queries queries into into appropriate formats formats needed needed by by each each registry registry Communicate each each registry registry Unify Unify and and distill distill results results returned from from registries Send a request for inquiry services using broker required syntax Receive the results from broker Service Requesters Figure 2.5. Broker-based service discovery mechanism. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs. Chord overlay network [82] 4. In this system, a set of keywords is extracted from the web services descriptions. These web services descriptions are indexed using these keywords. The index is stored at the peers in the P2P system. Each web service description is mapped to the index space. The underlying node joins, departures, failures, and data lookup are build upon the Chord network s lookup protocol. Speed-R [88] is a JXTA based P2P network system supporting semantic publication and discovery of web services. In this system, each service registry is controlled by a peer. Dogac et. al. [26] describe a way to expose the semantic of web service registries and connect the service registries through a P2P network for the travel industry. A general P2P discovery system (See Figure 2.6) contains 4 project 24

40 a data layer, a communication layer, and peers that control registries or service providers. The data layer can be formed by registries or service providers. Communication layers are implementations of P2P network, such as JXTA and Chord. Semantically enriched services and registries make the automation of the service discovery and the discovery of service registries possible. Peer A Peer B Peer N Semantic Enriched Services Or Registries Description Using ontology JXTA JXTA Chord Chord Communication Layer Registry A Registry B Registry N Data Layer Figure 2.6. P2P-based discovery mechanism containing a data layer, a communication layer, and peers that control registries or service providers. The traditional service discovery method, static discovery or manual discovery, relies on humans intervention by using a discovery system to locate and select a service description that meets the desired criteria at design time. The dynamically changing service environment requires service discovery that should be possible using a software agent during run time. The realization of the dynamic discovery mechanism needs machine processable semantics to describe services. The implementation and performance of a service discovery system depends on the available information in service descriptions. The more information the 25

41 system can gather, the more accurate results the system can give back to the requester. The implementation also depends on the kind of query that can be given by the requester. Two examples are: give a forecast service, give a forecast service which has fastest response time. For the first query, a simple key-word based discovery system is sufficient. For the second query, the discovery system needs to gather information on quality of service, find several forecast services in registries and rank them based on response time. The service discovery problem is related to the information retrieval problem. Two key quality measurements in information retrieval are also applicable when evaluating the performance of service discovery systems [45]. Recall is the number of relevant items retrieved, divided by the total number of relevant items in the collection. Precision is the number of relevant items retrieved, divided by the total number of items retrieved. The discovery mechanism in the traditional UDDI standard that only supports the static service discovery has been recognized as insufficient. This discovery mechanism often gives no result at all or gives many irrelevant results because keywords are a poor method to capture the semantics of a request. Synonyms (syntactically different words may have the same meaning) and homonyms (same words may have different meaning in different domain) can not be distinguished in a keyword-based retrieval. Also, relationships between different keywords in a request can not be captured. This mechanism offers low retrieval precision and recall. WordNet [102] is used to handle the synonyms and to employ an information retrieval model in the service retrieval process [99] so as to improve the precision and recall. WordNet is a lexical reference system developed by the Cognitive Sci- 26

42 ence Laboratory at Princeton University. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept, and the synonym sets are linked with different relations. WordNet is distributed as a data set. However, WordNet only supports the query of common words; vocabularies for a particular domain are most-likely not included in WordNet. With rich formal semantic descriptions added to web services, a service discovery system can provide more accurate results with high precision and recall. It also reduces human interference with the discovery process and makes the dynamic discovery possible. Therefore, semantic web technologies become a solution for this matchmaking process [47] [26] [28]. In the mean time, quality of service becomes an interesting topic for selecting optimal services from a subset of services that have the same functionality the requester asked for [10] [53] [48] [19]. Two types of semantic descriptions result in two types of semantic discovery system: (1) Adding semantics to current web services standards (UDDI and WSDL) [85]; (2) Using DAML-S and OWL-S to represent both functional and non-functional properties of web services enables software agents or search engines to automatically find appropriate web services via ontologies and reasoning-algorithm enriched methods. However, the high cost of formally defining heavy and complicated services makes adoption of this improvement unlikely in the current stage. Figure 2.7 shows, in three dimensions, existing service discovery mechanisms currently used in implementations of service discovery systems. A is a keywordsbased system, such as traditional UDDI. B is semantic enriched UDDI systems [85]. C is keywords-based P2P systems [82]. D is semantic-based systems with DAML-S or OWL-S [47] [26] [28]. E is semantic-based systems on P2P network 27

43 [88] [26]. Decentralized (P2P) C Dynamic E D Static B Centralized A Keywords based Semantic based Figure 2.7. Summary of existing service discovery systems with different discovery mechanisms mapped relative to three characteristics: degree of decentralization, richness of service descriptions, and static or dynamic. The research challenges residing in the service discovery process may suggest a way to integrate semantic and P2P technologies for building a discovery system. This system should allow automatic service discovery and provide high precision and recall at the same time, however, the cost of implementing this system makes it hard to be adopted at this time. 28

44 2.3.3 Service composition One of the most attractive features of service-oriented computing is that atomic services can be combined into a large application to solve complicated problems. The orchestration of a set of services to accomplish a larger and sophisticated goal is called a workflow. In the business world, a workflow is referred to as a business process. In the scientific domain, a workflow is sometimes referred to a scientific process. Several different approaches and platforms are being developed to achieve the common goal of the web service composition. These approaches range from adoption of industry standards to adoption of semantic web technology, from manual or static composition to automatic dynamic composition [90]. Since there is no standard service composition specification, each approach and platform defines its own way for service composition, provides its specifications and languages, and executes the workflow on a specific workflow execution engine. Current solutions for web service composition include the adoption of industrial standard, semantic web technologies [86] [29] [41], web components [111], Petri nets [112], and so on. The long term goal of a successful composition mechanism should meet several requirements: connectivity, quality of service, correctness, and scalability [58]. Adoption of industrial standards and adoption of the semantic web technologies are two active research areas among current service composition mechanisms. Both of these mechanisms support complex process activities, such as sequences, branching, etc. Current industrial standards include WSDL, UDDI, SOAP and a set of workflow specification languages (BPEL4WS, WSFL, BPML, WSCI, and XLANG) 29

45 used to support the data flow and control flow representations [98]. Among all of these specifications, BPEL4WS is the most mature and widely supported by the industry and research community. Service compositions described in the BPEL4WS format may be deployed on execution engines, such as BPWS4J [11] and Collaxa BPEL server [17]. The other model approach is based on semantic web technologies and AI planning techniques [84] [13]. In this model, services are semantically annotated with RDF/RDF Schema, DAML-S, or OWL-S. The objective is to enable automation of web service discovery, invocation, composition and execution. However, there is limited implementation and product support for generating service descriptions automatically at the current research stage. Most service composition models require application developers to possess complete knowledge of available services and the exact process logic. It depends on developers to choose a particular service at each step. Adoption of semantic web technologies allows automation of the composition process to be possible. There are two type of automation, semi-automatic and automatic. Both of them require the existence of domain ontology. The typical system [84] using the semi-automation method maintains a knowledge base which contains ontology of services, such as DAML-S or OWL-S. A matchmaker is used to find a service with required functionality. All the optional services that meet the requirement are presented to user with ranking of the quality at each step. The user makes a choice and continues the process. A typical system using the automatic method is often cooperating using AI planning technology [13]. The composition process starts from an explicitly defined goal. The workflow composition engine lets the service requester provide the input and output information. This information is 30

46 fed into an AI planner. The planner returns one plan, multiple plans, or no results to the end user for a further decision. Although the service composition problem is highly related to the AI planning problem, the current planning technologies can not be directly applied [90]. Services are dynamically changed and may have voluntary failure during execution time. A composed workflow that does desired work at one time may not work at another time. Preventing run time failure at design time is important. An issue in the automatic composition of web services is defining the compatibility [55] or connectivity [58] of services. It can be a time comsuming process to check if services to be composed can actually interact with each other. For example, the output of one service is a required input of the subsequent service in a workflow. It also requires a way to verify the soundness and correctness of the composite services. Much research has been done to explore this using AI planning techniques for automation of the composition process. It is still an open research problem whether or not it is possible to use or extend the current planning techniques in the service composition process and modeling of services. The application used most to motivate research in automatic service composition is a virtual travel agent example; typically, the motivations lack a real world example. This approach now may be practically used in domains with well-defined ontologies and a small number of available services in that domain. We believe the semi-automatic approach is more practical when large number of services exist in the domain. 31

47 2.3.4 Service execution Service execution is a process in which an atomic service or a composite service is invoked and results are returned to requesters. Atomic web services can be created with different languages and deployed on various platforms. Two major platforms are J2EE and.net. Since execution of atomic services does not require results from other services, the technologies to support atomic services are relatively mature (See Table 2.2). Service execution for composite services depends on the composition model and the existing execution engine support. The industrial standard based model can be transferred to a particular workflow specification, such as BPEL4WS, and executed on a workflow engine. The semantic web based model can be represented using the DAML-S specification and executed on a DAML-S Virtual Machine [84] or OWL-S execution engine. Since there is no standard service composition specification, each composition approach and platform provides its own specifications and languages for composite services and executes the workflow on a specific workflow execution engine. There are also composition toolkits that convert the visual graph composition of service into a language-specific workflow. Several issues exist in the service execution process. Synchronized vs. Asynchronized communication Web service technology is message passing oriented; the architecture should be able to support different message passing methods. Most service-oriented frameworks only provide support for synchronous invocation, such as Axis [3], which blocks the process before the response from service provider arrives. The loose coupled nature of web service requires more flexible invocation method. The requester should not be blocked because it is waiting for the response from 32

48 TABLE 2.2 EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR ATOMIC AND COMPOSITE SERVICES Service type Specification Execution Engine Atomic service WSDL Implemented using Java, C++, Perl, Python on.net, J2EE, gsoap, SoapLite OWL-S OWL-S execution engine Composite service BPEL BEPL4J OWL-S DAML-S XScufl OWL-S execution engine DAML-S virtual machine Freefluo providers. Various research has been done to support this asynchronized communication method [107] [113]. Centralized vs. Decentralized execution of composite web services Although most of the composite service execution engines invoke an individual atomic service on distributed service providers, the engine acts as a centralized coordinator for all interactions among these atomic services. Decentralized execution allows independent sub-workflows to interact with each other without any centralized control. It can reduce the amount of network traffic. Mangala Gowri Nanda et. al.[60] present an algorithm that partitions a composite services in BPEL into independent sub process. Each service provider should host a BPEL engine. Their experimental results show that decentralized execution can increase throughput substantially. Roger Weber et. al.[100] present a peer-to-peer based execution systems. In this system, 33

49 when a node finishes its part of the work, then the data is migrated to nodes offering a suitable service for one of the next steps in the process. Boualem Benatallah et. al.[6] present an environment where a composite service can be executed in a decentralized way within a dynamic environment. Monitoring service and workflow execution One of the issues in service execution is that it is possible the selected service in the workflow is unavailable or temporarily off-line. The execution engine then invokes the alternative service if one is defined in the workflow at the service composition stage. Service execution often needs a duration to be completed. Service requesters may require a monitoring service so that they can query the status of their requested services. Monitoring the service execution status is another important issue. The experience in grid computing research may be adopted in the SOA for building a reliable infrastructure for service execution. 2.4 Service-oriented computing in e-science An individual life sciences researcher or research group starts a scientific project by developing hypotheses, designing experiments to test those hypotheses, collecting observational data, and publishing results. The published data allows other researchers to build upon or verify the results. With the assistance from computer software, users can import the raw data, click on buttons, and retrieve the results. The analysis process, however, requires certain knowledge of how to use these toolkits and how to access these data from different locations. Even for users who posses this knowledge, this manual analysis process is a bottleneck when large data sets are involved. As the World Wide Web becomes a platform for scientific study (e-science), research data can be published on the web to be shared with other 34

50 researchers. These data can be distributed in various formats (such as RDBMS tables, text files, or XML documents) depending on the preferences and needs of research groups. Manually accessing these data files becomes difficult as these data may come from different institutes, different research groups, and in different formats. There is a need for a methodology that frees users from having to locate the data sources, interact with each data source, and manually combine data in multiple formats from multiple sources. Applying semantic web and web services technologies to support life sciences research becomes a promising solution to this difficulty. As the adoption of web services in the life sciences field grows, many large public resource sites are publishing web services interfaces in WSDL format to allow their data and analysis tools to be accessible to the research community, see Table

51 TABLE 2.3 LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES Service Provider Description Resources URL NCBI (the National Center for Biotechnology Information) Provides a varitiey of E-Utility web services to allow data retrieval against the NCBI database using WSDL and SOAP static/esoap help.html EMBL-EBI (the European Bioinformatics Institute) Provides a number of web services for data retrieval, data analysis tools, and ontology lookup using WSDL and SOAP DDBJ (the DNA Database of Japan) Provides web services for data retrieval, data analysis against DDBJ database using WSDL and SOAP KEGG (the Kyoto Encyclopedia of Genes and Genomes) Provides web services for data retrieval and data analysis against KEGG database SeqHound Provides web services for data retrieval from the sequence and structure database documentation.html 36

52 In e-science, a number of legacy data analysis tools are designed to be commandline applications. Soaplab 5, developed by EBI, is a SOAP-based web service utility used to wrap such command-line applications into web services. Recently, service-oriented computing middleware, capable of supporting life science experiments, have been developed. We believe that an ideal service-oriented architecture should allow service and data providers to publish their information into registries with semantically defined properties using domain ontologies; it should allow not only experts but end-users to define their workflow at a high level of abstraction using vocabulary provided in the domain ontology; allow the execution of the workflow and monitoring the workflow execution process; allow the reuse or partially reuse the existing workflow and support the data provenance management. Several workflow managment systems are developed in order to meet this goal. Discovery Net 6 is a service-oriented computing system based on an open architecture re-using common protocols and common infrastructures such as the Globus Toolkit for knowledge discovery. It is a multidisciplinary project serving application scientists from various fields including biology, combinatorial chemistry, renewable energy research and geology. The system allows service providers to publish and make data mining and data analysis software components as services. It allows data owners to provide interfaces and access to scientific databases, data stores, sensors and experimental results as services. It also allows users (scientists) to plan, manage, share and execute complex knowledge discovery and data analysis procedures. Besides re-use of the common protocols and infrastructure, Discovery Net define its own protocol DPML (Discovery Process Markup Lan

53 guage) for constructing and managing knowledge discovery procedures, as well as recording their history. The defined data analysis task (scientific workflow) can be executed on distributed resources, stored, shared, and re-executed. Pegasus 7 [34] [23] [2] is a framework that enables the mapping of complex scientific workflows onto the Grid. In the Pegasus system, an abstract workflow is a workflow in which the workflow activities (software components) are independent of the Grid resources used to execute the activities. The abstract workflow depicts the main steps in the scientific analysis including the data used and generated, but does not include information about the resource needed for execution. The abstract workflows can be constructed by using Chimera VDS (the GriPhyN Virtual Data System) 8 or can be written by users from a workflow template. The concrete workflow represents an executable workflow that includes details of the execution environment. It also includes the necessary data movement to stage data in and out of the computations. Other nodes in the concrete workflow also may include data publication activities, where newly derived data products are published into the Grid environment. A major focus of research on the mapping of abstract workflows to concrete workflows in the Grid computing environment is on how to find an appropriate resource currently registered at each step. Extra service components such as data transfer and data registration in the grid environment may have to be encapsulated in the workflow. This mapping process may be automated with algorithms and AI planning technologies if the resources are semantically well-described. During the mapping process, the workflow may be restructure, reordered, and refined to improve the overall performance and to

54 adapt to dynamically changing execution environments. The concrete workflow can be given to Condor s DAGMan 9 for execution. mygrid 10 is a service-oriented computing middleware for supporting life sciences researchers with the construction, execution, and sharing of scientific workflows using the Taverna 11 workbench. Researchers can use the graphic workbench to drag and drop service components into the model explorer. Recent mygrid developments focus on supporting users in the discovery and composition of services by using rich service annotations to make the workflow design more accessible to non-expert users. With incorporated semantic web technology, services and workflows can be described using domain specific ontologies. It is a valuable capability in a system potentially searching over thousands of services. Instead of locating available Grid resources, the semantic web enabled services annotation and discovery in mygrid is used to locate the software components or data that are exposed as web services. The executable workflow is written in XScufl language and executed in Freefluo workflow engine. Life sciences researchers can monitor the execution status through the Tavana workbench. In the mygrid system, the Feta data model is used to represent the semantic description of available services [50]. Web services are annotated with terms from an OWL-base mygrid domain ontology [103] with an GUI based interface Pedro [33]. This approach is more lightweight than the OWL-S and WSMO ontologies, although less expressive of details which could be more supportive for the automation process. Although the description methods adopted in mygrid has limited expressivity, they are sufficient for describing most services and their simplicity makes them more practical

55 for describing large number of services. IRIS [74] project is another project that targets discovery, composition, and interoperability of services required within in silico life science experiments. The IRIS project uses a semi-automatic procedure for identifying and placing customizable adapters (mediators) into workflows built by service composition. In IRIS, the capabilities of a mediator are described using the Mediator Profile Language (MPL). MPL is developed as a top-level ontology using the Web Onotology Language (OWL). BioMoby 12 is an open source research project which aims to generate an architecture for the discovery and distribution of biological data through web services [101]. Decentralized data and services are registered at a centralized registry called MOBY Central. The BioMOBY project focuses on the area of service description, discovery, transaction and simple input/output object type definitions. This foundational set of functionality allows client programs to expand the specification to include additional new features. The architecture provides a set of foundational functions that allows client programs to expand on the specification to include additional new features. There are two development tracks with different architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). The BioMoby project recently integrated access to many BioMoby features to the Taverna workbench interface using a Taverna plug-in. Users are guided through the construction of syntactically and semantically correct workflows from the graphic interface [44]. Open Middleware Infrastructure Institute UK (OMII-UK) 13 is a project that aims to provide software and support to enable the collaboration

56 of building infrastructure for the UK e-science community and its international collaborators. The OMII environment integrates other open-source software components to provide users a secure web services hosting and services execution environment. Users can deploy web services on different levels in the OMII server architecture, a normal Axis web service and a secure web service with the WS- Security support. GridSAM provides a Web Service for submitting and monitoring jobs managed by a variety of Distributed Resource Managers (DRM). The modular design allows third-parties to provide submission and file-transfer plugins to GridSAM. It also integrates GRIMOIRES, a registry service, to provide descriptions of services and workflows. The GRIMOIRES implementation extends the UDDI specification and provides not only the syntactic description but also semantic descriptions. The OGSADAI middleware provides data integration and secure infrastructure for exposing data resources as web services in a grid or any other context. WSRF::Lite follows on from OGSI::Lite Perl, the Web Service Resource Framework (WSRF) which was inspired by and supersedes the Open Grid Services Infrastructure (OGSI). WSRF::Lite provides support for the following Web Service Specifications: WS-Addressing, WS-ResourceProperties, WS-ResourceLifetimes, WS-BaseFaults, WS-ServiceGroups, WS-Security. 2.5 Conclusion In this chapter, we introduce several concepts related to SOA and discuss the integration of these technologies to solve some open issues in SOA research. Applying semantic web technology is intend to automate the web service discovery and composition process with little (or without) guidance of a human being. The challenges are: 1) define a high quality domain ontology; 2) interoperability of the 41

57 ontology among different domains; 3) correct annotation of large numbers of web services and data using the ontology; and 4) an agreed on definition of service composibility, soundness, and scalability. AI planning technologies used in the service composition process is largely studied at the theoretical level and often demonstrated with a well-defined, small domain, such as a travel agency, instead of large real world applications. Services provided in the Grid architecture, in particular, Globus toolkits, can be exposed with a web services interface and be composed into a workflow. When combined with Grid computing technology, this allows the creation of virtual organizations and groups, provides a service-oriented architecture that is more efficient and flexible with resource allocation and data transfer (such as gftp tool), and enables an increased level of privacy inside and between virtual organizations. Since Grid computing and service-oriented architecture are converging together, there are many standards and specifications that are constantly being expanded, updated, refined, and obsolete rather rapidly, it is hard to keep up with those evolving standards and specifications. For example, the Open Grid Services Infrastructure (OGSI) was published by the Global Grid Forum (GGF) as a proposed recommendation in June It was intended to provide an infrastructure layer for the Open Grid Services Architecture (OGSA). OGSI is now obsolete, and has been superseded by Web Services Resource Framework (WSRF). With the release of GT4, the open source tool kit is migrating back to a pure Web services implementation (rather than OGSI), via integration of the WSRF. Applying peer-to-peer technology can help to avoid central failure and increase the scalability during the service discovery and workflow execution process. Service-oriented computing is a new research area, with many in-progress 42

58 frameworks and middleware, workflow specification, WS-* standards, and ontological representations that have been presented without complete tool support. There are still many areas of research need to be addressed in order to build a complete, reliable, and ideal service-oriented architecture. 43

59 CHAPTER 3 A SERVICE-ORIENTED DATA INTEGRATION AND ANALYSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH In this chapter 1, we present a practical experiment of building a serviceoriented system upon current web services technologies and bioinformatics middleware. The system allows scientists to extract data from heterogeneous data sources and generate phylogenetic comparisons automatically. This can be difficult to accomplish using manual search tools since sequence data is rapidly accumulating and those manual tools will need to be repeatedly invoked as that new data becomes available. A web-based environment enables scientists to more effectively define a task, perform the task at a desired time, monitor the execution status, and view the results. The first prototype of this system is evaluated on a phylogenetic research application, Mother of Green (MoG). Our evaluation demonstrates that a service-oriented architecture can accelerate scientific research, increase research productivity, and provide a new approach to doing science. We also discuss issues in design and implementation of the system and identify our future research directions to enhance the system. 1 Portions of this chapter appear in the 40th Annual Hawaii International Conference on System Sciences, HICSS40, Hawaii, 2007[110] 44

60 3.1 Introduction As biological research is becoming increasingly data driven, scientists are conducting experiments using the cyberinfrastructure (in silico experiments) to gather information in public online databases and to test their hypotheses. These heterogeneous, independently developed data sources make traditional approaches insufficient for this type of research and experimentation. Complex queries against several of these databases may provide valuable new insights, but interoperability problems make this difficult. The researcher must often manually cut and paste data from one database resource to another and repeatedly use multiple tools to format and analyze the data, a process that may take days or weeks. In many investigations, the process stops once the scientist requires a workflow that is not feasible using manual retrieval and analysis. There is a demand for a methodology that frees users from having to locate the data sources, interact with each data source, and manually combine data in multiple formats from multiple sources. A promising solution to achieve the seamless interoperability among these data sources and analysis tools relies on the emerging technology of service-oriented architecture (SOA). SOA has been recognized during the past few years as an approach to achieve interoperability among multiple data sources [91] [92]. Many large bioinformatics database providers, such as NCBI, EMBL, DDBJ, already make their databases available via a SOA. Emerging toolkits and platforms, such as Soaplab [87] enable many data analysis tools to be wrapped as web services. These existing services permit software engineers to build unified interfaces for scientists to access heterogeneous data sources. The platform independent feature of SOA makes it a feasible solution to integrate increasingly available data analysis tools. 45

61 While there are protocols, toolkits, and middleware that are increasingly available to address the majority of the technical issues in building a data integration and data analysis environment, the question of how real world problems can be solved successfully using these technologies needs to be answered through practical implementations in a real world context. In this chapter, we describe the design and implementation of a web-based data integration and analysis environment. The underlying infrastructure is built upon current web service technologies and bioinformatics middleware to enable biologists to better utilize heterogeneous genomic data. The first prototype of the system is used in a phylogenetic research application, the Mother of Green (MoG). MoG is a collaborative research project on plastid phylogenetic analysis involving information technologists and biologists. Genomic sequence data is accumulating faster than scientists can find and analyze it using manual search tools. The SOA-based platform allows scientists to extract data and analyze phylogenetic comparisons automatically. The web-based environment enables scientists to more effectively define a task, perform the task at a desired time, monitor the execution status, and view the results. The overall aim of this project is to provide an easy-to-use environment for biologists to research the puzzle of plastid phylogeny and to answer an open question on the phylogenetic history of the plastid genome. In the rest of this chapter, we briefly review web service technologies and related work followed by an overview of the MoG project and a description of the overall system architecture. We then describe a prototype implementation of the system, related issues, and extensions of the system. 46

62 3.2 Related work The service-oriented architecture (SOA) was proposed initially as an emerging paradigm for business process integration inside or across organization boundaries. It is gaining significant attention from the scientific research community for use in building e-science infrastructures. The proposed standard in grid computing, Open Grid Service Architecture (OGSA) [63], is built upon service-oriented architecture and demonstrates the convergence of the Grid with SOA. Three basic standards in SOA, Simple Object Access Protocol (SOAP), Web Services Description Language (WSDL), and Universal Description, Discovery and Integration (UDDI), are sufficient for providing simple atomic services. However, single atomic services are not adequate for developing complex applications. One of the most important features of SOA is that services developed in different groups can be combined as a workflow to solve complicated problems. This feature leads to several research issues and challenges including service discovery, services composition, and service enactment. Semantic web technology[54] [7] and peer-to-peer technology are used in SOA to automate the service discovery process and make the service enactment more reliable. BioMOBY is an open source research project which aims to generate an architecture for the discovery and distribution of biological data through web services [101]. Decentralized data and services are registered at a centralized registry called MOBY Central. The BioMOBY project focuses on the area of service description, discovery, transaction and simple input/output object type definitions. This foundational set of functionality allows client programs to expand on the specification to include additional new features. The architecture provides a set of foundational functions that allows client programs to expand on the specification 47

63 to include additional new features. There are two development tracks with different architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). REMORA [14] is a web server implementation base on the BioMOBY service specification. It provides life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system. Another project, mygrid, provides e-science application developers a toolkit based upon a high-level middleware layer. It builds on and extends the Grid framework of distributed computing through a SOA. It not only provides a semantic based service discovery system but also the Taverna workflow bench [65], personalized data repositories, provenance and update notification. The direct users of mygrid are users who build applications using the mygrid toolkit [94]. Compared to the BioMOBY project, mygrid has more ambitious goals. Bioinformaticians, tool builders and service providers can collectively or selectively employ these middleware services to produce applications that support research in the biological and life sciences [36]. The IRIS [74] project is another active project that targets the service discovery, composition, and interoperability of services required within in silico experiments. The IRIS project handles this problem through a semi-automatic procedure for identifying and placing customizable adapters into workflows built by service composition. Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface (BioWBI) [9] are two projects provided by IBM aphaworks to allow life science researchers to build and execute bioinformatics workflows and share their analysis processes. WsBAW is an application that automates bioinformatic workflow by deploying 48

64 a web service. BioWBI is an easy-to-use, Web-based working environment from which a life sciences researcher and/or a research community can build and execute bioinformatic workflows and share their analysis processes. IBM alphaworks provides the applications, Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface (BioWBI) [9]. WsBAW is an application that automates bioinformatic workflow by deploying a web service. It consists of a client application through which users are able to send batch requests to a specific bioinformatic workflow execution engine, such as BioWBI, by using a Web service. BioWBI is an easy-to-use Web-based working environment from which a Life Sciences researcher and/or a research community can build and execute bioinformatic workflows and share their analysis processes. 3.3 Motivation The motivating application is the phylogenomics of the plastid. Named the Mother of Green (MoG) project by an multidisciplinary team of computer scientists and biologists, MoG aims to identify the most recent common ancestor of all plastids. While many biologists support the view that all plastids are descended from a single endosymbiont ancestor, the data are not conclusive due to missing information and inefficient use of existing information. Using the nucleotide and amino sequences of expressed genes to infer ancient ancestral relationships, MoG investigators hope to identify which of the ancestral plastid genes have traveled into the host nucleus and why some genes are more likely to be transferred than others. The rate of data accumulation, the rapid development of new phylogenetic analysis tools, and the refinement of existing tools simply overwhelm the 49

65 researchers. The biologists need a better approach than manual or ad-hoc scripting to accumulate and analyze enough relevant data to rigorously test the single ancestor hypothesis Use case A typical phylogenetic analysis process consisting of multiple manual data collection and data analysis steps is described below and shown in Figure 3.1. A typical in-silico investigationdata driven research workflow A: A: Query complete genome sequences given a taxon B: B: Query protein coding genes for for each genome sequence E: E: Phylogenetic analysis D: D: Sequences alignment C: C: Eliminate vector sequences Ph.D defense 12 Figure 3.1. A manual phylogenetic data collection and data analysis process A) Biologists send a query to a data provider, NCBI for example, through a web-based interface to retrieve the whole genome sequence of a specified taxon. After recording the query terms and results, the investigator must examine the list of sequences, delete inappropriate entries and then add new entries based on their knowledge of plastid phylogenomics or from sequences generated in their own lab. 50

66 B) For each whole genome sequence, biologists need to find specific protein coding genes, or the specific subunits of protein coding genes, or specific active sites within a specific gene or subunit. This is an iterative process for each entry in the list. C) Each nucleotide sequence must be checked for vector sequences, a common contaminant of nucleotide sequences in unvetted public databases, and any detected vector contaminants removed. D) Biologists then choose a subset of these genes and use a sequence alignment program, (e.g. ClustalW), to align the sequences. After viewing the results, biologists may decide to choose another subset for sequence alignment analysis or continue the comparison using phylogenetic tree building tools. E) Once the initial sequence alignment results prove satisfactory, biologists convert the alignment output to the appropriate data format required by the phylogenetic analysis programs, such as PAUP or Phylip Operational barriers The data retrieval and data analysis processes need to be repeated multiple times, as different hypothesis are evaluated and new data pours into the public databases. From an operational perspective, this repetition makes the research process time consuming or even impossible using manual approaches. Other barriers also make this particular scientific research process even more difficult. Data collection The capabilities offered by a data retrieval system cannot always meet the requirements of scientists. Entrez [61] is a web-based data retrieval system available from NCBI that provides integrated access to multiple databases covering a variety of data domains, including com- 51

67 plete genomes, nucleotide and protein sequences, gene sequences, threedimensional molecular structures, literature, and more. However, sometimes scientists are not able to get desired information with a simple query. For instance, find all of the subunits for the plastid ATP synthase requires that the investigator first identify the official protein names of all subunits of which there many (atp alpha, atp beta, atp gamma, atp delta, atp epsilon and so on) for the plastid-specific ATP synthase. The next process is to retrieve these sequences for each new genome and to merge these data with the data previously retrieved. Analysis tool usage Each data analysis program may have different requirements for input data formats even for programs providing similar functionalities. Correct use of these programs and correct implementation of this workflow relies heavily on the researcher having detailed knowledge and understanding of each tool. A typical work unit might be: find all of the sequences for atp synthase alpha subunit that are most similar to the atp alpha synthase sequence found in Prochlorococcus, align the sequence using clustalw, save that output, then reformat the data and submit the sequences to Phylip for phylogenetic analysis. The output from one data analysis program needs to be fed into the next program as its input with appropriate conversion to the required data format. The rapid development of new data analysis tools and the refinement of existing tools make the manual data conversion process even more difficult. Experimental record keeping Accurate recording of an in silico investigation, including materials, methods, and results is as important as accurate recording of bench top or field experiments. Keeping the provenance data, includ- 52

68 ing the input, output, and intermediate data sets is also critical. Manual organization of these metadata quickly approaches impossibility for anything but the most trivial of queries. An easy-to-use environment is essential and necessary to support the automation of deep phylogenetic analysis. For many years the data were sparse. Now mountains of data exist but our limited 20th Century tools do not properly equip us to mine for the gems within them. Automation has become necessary. 3.4 System architecture The whole system, the MoGServ, includes an underlying infrastructure, MoGServ middle layer, and a web-based environment that provides an easy-to-use interface for scientists to access functions provided by the middlelayer. The system acts as both service consumer and service provider in the context of SOA. While it consumes and aggregates services provided by other service providers, the system also provides services that can be used and integrated by other applications. There are two roles in the design and implementation of the system, endusers and software developers. End-users are biologists who focus on the study of what information needs to be gathered and what data analysis needs to be preformed. The software developers are responsible for several tasks based on endusers requirements: collecting and annotating available services; creating services to implement functions in the specific application; building workflows to automate a variety of tasks required by end-users; providing a flexible, high performance, fault-tolerant infrastructure to execute the workflows; providing a mechanism for end-users to keep track of the origin of the data (data provenance); and providing end-users a web interface to configure a task, monitor the execution status, and 53

69 view results. An overview of the MoGServ system architecture is given in Figure 3.2. MoGServ System Architecture Web Interface Data Access Services Data Analysis Services NCBI Application Server Local Data Storage Job Manager Job Launcher DDBJ EMBL Applications Service/Workflow Registry Metadata Search Workflow/Soap Engines Services Others Services Access Client MoGServ Middle Layer Data/Services Providers Figure 3.2. MoGServ System architecture includes a services access client, MoGServ middle layer, and other data and services providers Data storage and access service Data collection from multiple distributed data resources is one of the first steps of a bioinformatics research project. In the MoG project, an in silico experiment involves the collection of large data sets, a computational and memory intensive process that involves daily checking for new information and quality control for each new sequence detected. Some data service providers limit the number of connections to their data server for performance concerns. The refresh rate of the 54

70 data in a data source is much lower than the rate of end-user requests for the data. Therefore, a local data storage is required to store biological data collected from remote data providers, to avoid repeated vetting of the same data and to insure access to the data for time sensitive projects. The biological data from remote data sources is gathered, aggregated, and integrated into the local database through a set of data access services. An in silico experiment also requires the integration of results from numerous data analysis tools. Recording the intermediate data in the local database allows MoGServ to preserve the data provenance and provides opportunity for end-users to keep track of where a piece of data has come from. The information stored in the local database can be accessed through a set of data access services Service and workflow registry A service and workflow registry provides a repository to store descriptions of services and workflows that may be used in a phylogenetic study. These services and workflows include both locally constructed and preexisting services. The registry also provides functions to allow inquiries about services or workflows. In the first prototype, neither UDDI-based registry nor semantic-based descriptions are employed. While a UDDI type registry is more business-oriented and may not be a perfect fit for this application, the semantic-based description takes more time to define a commonly used ontology. The current registry is a simple table with focus on capturing both functional and non-functional properties of services and workflows to support service selections, service and workflow enactment, and provenance data representation. Semantic-based description and inquiry provides the attractive capability of automating service discovery and will be used in the 55

71 TABLE 3.1 ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIPTION Attributes id name text description location input/output provider version algorithm invocation method Description a unique sequence number assigned to the service/workflow during the registration process the name of the service or workflow description of the functions provided by the service or workflow the URL of the definition of the workflow or WSDL location of the service description of input/output parameters the name of the service or workflow provider the version of the service or workflow implementation the algorithm used in the service or workflow implementation the method used to execute the service or workflow next version of MoGServ. The description of a service or workflow includes attributes as shown in Table 3.1. When end-users view results from their experiments, they may ask a question which algorithm was used to generated the data and what is the source of the data? Service consumers may prefer a service or workflow based on their preference for a particular algorithm or provider. For example, a sequence alignment service can be implemented using the Sequence Alignment and Modeling System (SAM) or ClustalW Indexing and querying metadata The data is best managed with a relational database; however, for searching purposes, an indexer is more efficient. We identify and extract metadata about 56

72 additional actual data sequences, experiments, services and workflow descriptions in the local database. For example, the metadata of a gene sequence includes gid, accession number, name of the sequence, from which organism, and taxonomy. An experiment can generate results that leads to new or more detailed information requirements and a new series of experiments. End-users may need to know the origin of a piece of data which query was used to get this subset of sequence, when was the data generated, what process was used to generate the results. This may lead to new experiments using different data sets or even different methods. These metadata are extracted and indexed by a metadata indexing service. This service is triggered when new data is added into the database. A metadata searching service provides functions to query an index Service and workflow enactment The system supports both synchronized invocation and asynchronized invocation methods. Synchronized invocation is mostly used for invocation of services or workflows with short running times, e.g. querying sequence data or job information in a local database. Asynchronized invocation is used for executing long running services and workflows. As shown in Figure 3.3, the job manager accepts the input parameter of the service/workflow, service/workflow id, and timer. The definition of the services and workflows is found in the registry. A job definition including the services or workflows URL, input parameter, timer, and other metadata of the job information (such as when and who submitted this job) is stored in the database. A job id used to identify the job is generated. The job launcher periodically checks the database to retrieve a service and workflow which needs to be executed at a time 57

73 point. Multiple workflow engines are deployed on different nodes to prevent single engine failure and achieve higher performance. A similar mechanism is used for deploying long running services to prevent service failure. Each node hosts a service that is responsible for returning the current load information of the node. This information is used by the job launcher to dispatch a job to an optimal node. With the SOA, it is easy to distribute and invoke workflows and services remotely. The execution status of the workflow or service is recorded into the database as an attribute of a job description. This information can be used for implementing failure recovery functions, such as restart. The job information accessed through data access services allows end-users to monitor the execution and view the results. Service and workflow enactment Service/Workflow Registry INPUT Find the service/workflow definition using the task name Parameters Task Name Job Manager Form a Job Description Job Information Job Launcher Timer Output Job ID Instances of of Workflow/Service Engines Ph.D defense 18 Figure 3.3. Asynchronized services and workflow invocation model 58

74 3.5 Implementation Development and deployment tools Among a large number of programming platforms for web services development and deployment, Microsoft s.net and Sun s J2EE typically are two main choices for applications and middleware developers. With consideration of future extensions of the system as well as our previous experience with Java, the J2EE based platform appeared more suitable for MoGServ. In particular, Apache s open source tools - Tomcat(5.0.18) and Axis(1 2RC2), are used. Tomcat/Axis are active projects with support from the open source community. Another open source software tool, Eclipse, is used to develop the web interface for the system. There are more than a dozen proposed languages to coordinate messaging and transactions among independent web services. The business process execution language for web services (BPEL4WS) is a promising workflow language since it has wide support from IBM, Microsoft, and BEA. Several workflow enactment engines, such as BPWS4J, Collax, ActiveBPEL, are already in place to support the execution of workflow. While a business-oriented workflow language and corresponding execution engine can be used in the scientific domain [20], the Taverna [65] project possesses more attractive features and naturally fits the development of our system. The Taverna project is open source and a part of the mygrid project developed in the e-science community to support data-intensive in silico bioinformatics experiments. The Taverna workbench provides a graphical tool for building, editing, and browsing workflows and generates a XML-based Simple conceptual unified flow language (Scufl) document. The embedded workflow execution engine, Freefluo, facilitates testing during the development process. 59

75 Freefluo, a Java workflow enactment engine, which supports the Scufl specification, coordinates execution of the parallel and sequential activities in the workflow and supports data iteration and nested workflows. The enactor can invoke arbitrary WSDL type service operations as well as more specific bioinformatics service operations such as Soaplab and BioMoby. Apache Lucene [51] is used in our system for building a search engine to support full-text search on sequence data, intermediate data results, and job information stored in the local database. Since Lucene is a search engine library written entirely in Java instead of a command line toolkit, it provides flexibility to write a variety of applications with rich search capabilities. These capabilities include ranked searching, phrase queries, wildcard queries, proximity queries, fielded searching, and so on. PostgreSQL(8.0) is used to store all the intermediate data results, job information, sequence data, and services/workflow descriptions Services provision We create web services using the RPC style due to its easy implementation with full support from most tools. As most bioinformatics applications take a number of input parameters and produce a number of outputs, we use an XML document to represent the input/output of a service for which a large number of parameters are needed. The XML document is provided as a single input parameter to the service or workflow and the output results are produced as a single XML document. Using this method, the service consumers themselves create a valid and accurate XML document for input while service providers parse the XML and extract the input parameters. 60

76 Multiple services are created and deployed on the Tomcat/Axis server using the Java2WSDL and WSDL2Java toolkits. Individual services can be invoked statically or dynamically through a client side application. They can also be used as a building block in the workflow creation process. We separated services provided in the first prototype into the following categories. Data collection The original data source is NCBI. NCBI s Entrez Programming Utilities (eutils) provide access to Entrez data outside of the regular web query interface and help for retrieving search results for future use in other environments. With the eutils SOAP interface, we create services to get data, such as complete genome sequences and specific genes of interest. Query local database All the intermediate data and job information are stored in the local database to help biologists keep track of the data provenance and monitor the job execution. Also in this particular application, biologists are interested in selecting sequence subsets from the local database and using sequence alignment services to do preliminary comparisons. A set of services are implemented to query desired information. Indexing and querying metadata The creation and update of each of these indices is done by a service operation. The index service is triggered whenever new data is stored in the database. The query service accepts a query string and an index name to search the index and return output. Data format services Each particular data analysis tool used in a bioinformatics study requires a specific data format as input. A set of data format services in the system is implemented to convert data into an appropriate format. This type of service can be used in a workflow creation process or used explicitly. Data analysis services Many existing data analysis tools in bioinformatics re- 61

77 search are available as command line applications. The creation of a data analysis service is a process to wrap these toolkits as web services. JLaunch [42] is a lightweight Java library for launching command line applications from Java programs. With the JLaunch library, we can write Java programs to execute any type of command line programs Workflow engine The Freefluo workflow engine is deployed on a application server. The invocation of the workflow engine is done by generating a local stub specific to the Freefluo web services API. The local stub is implemented as part of the job luncher in our system. The execution of a workflow on the Freefluo engine follows the following steps: 1) obtain a proxy to the remote Freefluo server; 2) create a Scufl model; 3) pass a XScufl workflow to the Scufl model and form the input using the Baclava data model, a representation of Taverna data type 2 ; 4) compile the XScufl workflow as a workflow instance; 5) execute the workflow instance and obtain an ID from the server; 6) poll the Freefluo engine until the execution has completed; 7) retrieve a list of outputs from the server; 8) extract the required output from the Baclava data model; and 9) destroy the workflow instance Building workflows A Scufl workflow represents a procedure as a set of processes and the relationships between these processes. Our workflow design uses available services as building blocks whenever possible and creates new ones when necessary. The

78 Taverna workbench provides a graphical tool to build and test workflow as well as a number of integrated bioinformatics services. The Scufl language has some useful features such as implicit iteration and conditional branching that are most important for building workflows in this application. During the construction of workflows, we often encounter the case that output of one service can not be completely fit to the input of the next chosen service. One approach we take is to create a new service, such as the type of data format service described above, and expose it in the same way as other services. An alternative approach provided in the Taverna workbench is to use the Beanshell scripts [4] to convert the output to appropriate input. We create a number of workflows using the Taverna workbench to support the research. One example is shown in Figure 3.4. It is a workflow used to retrieve a complete genome sequence and particular gene sequences from the NCBI site. The workflow accepts two inputs, the query term and the particular gene group. The service genome gids by terms returns a String of gids and a Beanshell script converts the String to a list of gids. The service Get Nucleotide Fasta, a third party service, accepts a gid and returns a sequence in fasta format. The implicit iteration method in the Xscufl workflow enables iteration for all the gids in the list. With the service-oriented architecture, the same services can be used for different workflows, minimizing the need to create new services Web interface The web interface provides scientists a convenient interface to configure their tasks, monitor the job execution status and view results. It is implemented with a number of server side JSPs (Java Server Pages). The returned results 63

79 are transformed with appropriate XSLT to HTML pages. The service-oriented architecture provides flexibility of building the front-end web application with different languages, e.g. Perl, and deploying on a different web service engine, e.g. Apache/SOAP::lite. 3.6 Discussion Although current development and deployment tools haven t implemented all the features claimed in the service-oriented architecture specification, they are actively evolving to make it happen. In particular, the Apache Tomcat/Axis, Taverna workbench, and Freefluo engine enabled the implementation of our first prototype. In general, SOA offers considerable benefits for building the system: 1) The loosely coupled feature of SOA facilitates the distribution of computational intensive processes across multiple nodes; 2) The platform independent feature of SOA facilitates the integration of data from heterogeneous data resources through distributed web services; 3) The composition-of-services feature allows reuse of a service in multiple workflows minimizing the need to create new services; and 4) SOA also provides flexibility for building the front-end web application with different languages, e.g. Perl, and deploying on different web service, e.g. Apache/SOAP::lite. While we believe a simple SOA architecture is appropriate in the design and implementation of our system, there are various aspects of the system that need to be improved. We summarize issues and the directions to enhance the system in this section. 64

80 3.6.1 Issues with the first prototype Security Although security was not our major concern during the first prototype implementation, it is an important component in the next implementation. Services and workflows provided in the system allow users to access the computational and data resources in the system with no restrictions. A certain level of security is required to prevent abuse of the system and to protect sensitive data and analysis results. An authorization component should be built in the system to enable users to access the permitted services and to personalize their own workspace. A web portal will be built to enable users to create an account, login and logout with username and password. The user account information including the access level will be stored in a database. The GridSphere portal framework [39], an open-source portlet based web portal, is one of the candidates. Service and workflow description and selection In the first prototype implementation, the same development group acts as both service provider (services/workflow creation) and service consumer (building the web-based application using these services and workflows) roles. While there is no demand for supporting the selection of appropriate services/workflows, the major capability of the index-based services/workflows registry is to keep track of data provenance and to provide definition for performing services/workflows. However, the index-based syntactic description services/workflows provide limited flexibility for third party service consumers to choose appropriate services/workflows provided in the system and to integrate them into their application without prior knowledge. Failure tolerance and recovery The workflow or service execution may fail at some point due to the failure of the enactment engine, failure of the service, 65

81 and failure of the network fabric [64]. Our system handles these failures during the static workflow design stage and services or workflows invocation stage. Multiple workflow engines and long running services are deployed on different physical locations. It allows a submitted task to be invoked on the most idle site to achieve higher performance. More importantly, this approach can prevent dispatching services/workflows to the engine with a physical failure. Recording execution status of long running services/workflows in the database allows us to add policies for determining if a failed service/workflow should be restarted. The Taverna workbench and Xscufl provide a capability that allows users to specify an alternate service and to configure basic fault tolerance mechanisms during the workflow design stage, which can prevent the failure of services to a certain degree. Another more promising, yet more complicated approach for failure recovery is to support the dynamic selection of alternate services during execution time. However, the implementation of this feature requires services to be described in rich semantic formats using a widely accepted ontology. Data provenance In the system, the metadata description of sequence, job information, and services/workflows are stored in the database. A set of indexing and querying services allows end-users to trace the origin of the data, which is a desired feature for scientists. Also, the workflow engine and Xscufl provides mechanisms to record more detailed information including the type of processor, status, start and end time, and a description of the service operation. A systems administrator may be interested in using this information to investigate how results, in particular erroneous or unexpected ones, were produced by workflow processes. 66

82 3.6.2 Extension of the system Although the first prototype of the system focuses on design and implementation based on relatively mature technologies in service-oriented architecture, we are extending the system to address some issues described above with grid computing and semantic web technologies. Grid technologies specify the mechanisms for distributed resource management, coordinated fail-over, and security. As the Grid technologies, and Grid framework Globus toolkit [97] in particular, are evolving towards the OGSA standard, integration of the Grid technologies into the system can help address some issues discussed above. The convergence of service-oriented architecture and Grid technology allow us to enhance the system through the integration of existing components. In a scientific domain, the process used to generate the output of a service and workflow is often as important as the result. As is the case with bench scientists, in silico investigators will decide for themselves which methods and which data will be used for their study as well as what kind of outputs they are expecting. In the first prototype implementation, this requirement is satisfied through close collaboration among team members. As this system will be used by a phylogenomics research community that spans multiple disciplines, different investigators will have their own methods for approaching problems of common interest. A mechanism that allows end-users to define the workflow at a higher level of abstraction is required. Instead of choosing specific services to form a workflow, scientists would rather define a workflow by specifying functions that a service should provide. Different levels of training and experience also require different levels of abstraction. For example, a 67

83 graduate student in a particular research domain may have limited knowledge of the methods available to perform an experiment, while an experienced investigator may know ahead of time which building blocks are required and which approach is most efficient for the scientific hypothesis to be tested. We represent different abstraction levels in Figure 3.6. End-users may need to define the workflow at any one of these four stages based on their knowledge of provided services. A concrete workflow, which can be sent to a workflow engine, is represented at the fourth phase. The conversion from the third phase to the fourth phase is related to choosing an instance of a service with Quality of Service (QoS) metrics. One service interface may have multiple implementations provided by different service providers. These implementations have different quality properties such as trustworthiness, cost, execution time, and so on. An optimal service should be chosen during this conversion process. The conversion from the second phase to the third phase requires mapping a particular task to a service, or a sequence of multiple services. This mapping process can be accomplished manually by software developers in an ad-hoc way, like the approach we took in the implementation of the first prototype. This approach relies heavily on developers knowledge of services and logical ordering in the workflow. Preferably, this process should be able to be done partially or wholly automatically. In order to support this semi-automatic or automatic process, a complete presentation of knowledge should be in place to allow software agents to substitute the work of the human. Using semantic web technology, in particular OWL and OWL-S, to represent the ontological representation of domain knowledge and semantic description of services is a promising approach. Semantic web technol- 68

84 ogy offers promising features for supporting bioinformatics research [12]. Some bioinformatics middleware, such as the mygrid and BioMoby projects, have their own approaches to support automated discovery and composition of services using semantic web technology [49]. Much research has been done exploring AI planning techniques for automation of the composition process. The long term goal of a successful composition mechanism should meet several requirements: connectivity, quality of service, correctness, and scalability [58]. Although there are still practical difficulties in developing semantic web services, we believe that the appearance of tools for creating ontologies, annotating services [89], and development of widely accepted domain ontologies allow us to add semantics into our system and support the automation of the mapping process. 3.7 Conclusion As both data and tool providers begin to present their resources with web services interfaces, and as open source tools and middleware for supporting web services, workflow generation, and enactment become more available, biologists will begin to use those available services, as well as begin to provide service access to their databases and programs for sharing within the bioinformatic community [65]. Our system is a demonstration of progress toward this goal. In summary, current SOA standards and toolkits are sufficient to build the first prototype of MoGServ. MoGServ is in its early stage of development with limited services and workflows available. The basic implemented functionalities enable the user to collect data and do preliminary data analysis as well as metadata searching. By using the system, scientists are able to get some scientific insights about the 69

85 alpha subunit of ATP synthase and indicate that it retains the signal of a very ancient line of descent while having enough polymorphism to infer phylogenetic relationships [78]. Building the system upon the SOA provides us flexibilities to integrate services, to build a variety of workflows, and to build a web portal for scientists to access the system via a web interface. New features and services are continuously being added to the system in response to scientists feedback and requirements. The future direction of our research will be to focus on enhancing the system using semantic web and grid computing technologies. 70

86 Figure 3.4. A workflow built using Taverna workbench to get complete genome sequences and specific gene sequences 71

87 Figure 3.5. A workflow for querying two subset sequences from local database, filtering out sequences coming from same organism, and doing sequence alignment analysis Figure 3.6. Abstraction of user defined workflows 72