SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS. A Dissertation. Submitted to the Graduate School

SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Xiaorong Xiang, B.S., M.S. Gregory R. Madey, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana April 2007

SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS Abstract by Xiaorong Xiang Service oriented architecture (SOA) is a new paradigm that originated in industry for future distributed computing. It is recognized as a promising architecture for application integration inside and across organizations. Since their introduction, semantic web and web services technologies are increasingly gaining interest in the implementation of e-science infrastructures. In this dissertation, we survey current research trends and challenges for adopting SOA in general. We present a practical experiment of building a service-oriented system for data integration and analysis using current web services technologies and bioinformatics middleware. The system is enhanced with an ontological model for semantics annotation of services and data. It demonstrates that adopting SOA in the e-science field can accelerate the scientific research process. A new methodology and an enhanced system design is proposed to facilitate the reuse of workflows and verified knowledge.

DEDICATION To my parents, my husband, and my son ii

CONTENTS FIGURES.................................... vii TABLES.................................... xi ACKNOWLEDGMENTS........................... xii CHAPTER 1: INTRODUCTION....................... 1 1.1 Main contributions of the dissertation................ 4 1.2 Organization of the dissertation................... 7 CHAPTER 2: RESEARCH ISSUES AND CHALLENGES IN SERVICE- ORIENTED COMPUTING........................ 8 2.1 Introduction.............................. 8 2.2 Overview of related concepts and technologies........... 10 2.2.1 Web services.......................... 11 2.2.2 Semantic web......................... 13 2.2.3 Grid computing........................ 14 2.2.4 Peer-to-peer computing.................... 15 2.3 Issues in the service-oriented computing.............. 16 2.3.1 Service description...................... 17 2.3.2 Service discovery....................... 22 2.3.3 Service composition...................... 29 2.3.4 Service execution....................... 32 2.4 Service-oriented computing in e-science............... 34 2.5 Conclusion............................... 41 CHAPTER 3: A SERVICE-ORIENTED DATA INTEGRATION AND ANAL- YSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH.... 44 3.1 Introduction.............................. 45 3.2 Related work............................. 47 3.3 Motivation............................... 49 iii

3.3.1 Use case............................ 50 3.3.2 Operational barriers..................... 51 3.4 System architecture.......................... 53 3.4.1 Data storage and access service............... 54 3.4.2 Service and workflow registry................ 55 3.4.3 Indexing and querying metadata............... 56 3.4.4 Service and workflow enactment............... 57 3.5 Implementation............................ 59 3.5.1 Development and deployment tools............. 59 3.5.2 Services provision....................... 60 3.5.3 Workflow engine....................... 62 3.5.4 Building workflows...................... 62 3.5.5 Web interface......................... 63 3.6 Discussion............................... 64 3.6.1 Issues with the first prototype................ 65 3.6.2 Extension of the system................... 67 3.7 Conclusion............................... 69 CHAPTER 4: EXPLORING THE DEEP PHYLOGENY OF THE PLAS- TIDS WITH THE MOGSERV...................... 73 4.1 Introduction.............................. 73 4.2 System and methods......................... 76 4.2.1 Data model.......................... 77 4.2.2 Services............................ 79 4.2.3 Data collection........................ 79 4.2.4 Local query.......................... 82 4.2.5 Set management....................... 82 4.2.6 ClustalW........................... 84 4.2.7 Blast.............................. 84 4.2.8 Phylip and Paup....................... 86 4.2.9 Data conversion........................ 87 4.3 Results of case studies........................ 87 4.3.1 Case study: the rediscovery of Erythrobacter litoralis... 88 4.4 Summary............................... 89 CHAPTER 5: ONTOLOGICAL REPRESENTATION MODEL...... 91 5.1 The MoG life sciences project and biomedical application..... 92 5.2 Ontological representation model.................. 93 5.2.1 RDF, OWL, and DIG reasoner............... 94 5.2.2 Generic service description ontology............. 97 5.2.3 Service domain ontology................... 98 5.2.4 MoG application domain ontology.............. 99 iv

5.3 Implementation............................ 102 5.4 Conclusion............................... 104 CHAPTER 6: IMPROVING THE REUSE OF THE SCIENTIFIC WORK- FLOW.................................... 106 6.1 Introduction.............................. 106 6.2 A hierarchical workflow structure.................. 109 6.3 An enhanced workflow system.................... 113 6.3.1 Knowledge management................... 116 6.3.2 Knowledge discovery..................... 120 6.4 Translation process.......................... 120 6.4.1 Service discovery and matchmaking process........ 120 6.4.2 Knowledge reuse....................... 122 6.4.3 Implementation and evaluation............... 124 6.5 Workflow reuse............................ 126 6.6 Related work............................. 128 6.7 Conclusion and future Work..................... 129 CHAPTER 7: SUMMARY AND FUTURE WORKS............ 131 7.1 Summary............................... 131 7.2 Limitations and future work..................... 132 APPENDIX A: GLOSSARY.......................... 135 A.1 Pictures................................ 137 APPENDIX B: MOGSERV MANUAL.................... 141 B.1 Main.................................. 141 B.2 Retrieve genome and gene data from NCBI database....... 141 B.3 Query local database......................... 141 B.4 Set management........................... 142 B.5 Data analysis services......................... 143 B.6 Job mangement............................ 145 APPENDIX C: DEVELOPMENT AND DEPLOYMENT TOOLKITS.. 155 APPENDIX D: SUPPLEMENTARY MATERIAL FOR CHAPTER 3 AND CHAPTER 4................................ 157 D.1 Complete genome sequence in XML................. 157 D.2 Example of a ATP synthase subunit B sequence.......... 159 D.3 Protein name............................. 160 D.4 Syntax of search local database................... 160 D.5 Workflow of retrieve sequence.................... 160 v

D.6 ClustalW input............................ 163 D.7 Blast.................................. 163 D.8 PAUP................................. 164 APPENDIX E: SUPPLEMENTARY MATERIAL FOR CHAPTER 5 AND CHAPTER 6................................ 168 BIBLIOGRAPHY............................... 175 vi

FIGURES 1.1 The evolution of the Web, yesterday s web is a repository for text and images; today s web is a platform to publish and access dynamically changing new types of contents provided by a variety of services.[8]............................... 2 2.1 Two basic components in a simple service-oriented architecture. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester.................... 12 2.2 Web services standards stack includes mutliple layered and interrelated open standards.......................... 13 2.3 Venn Diagram representation of integration web service, grid computing, semantic web, and peer-to-peer technology into the realization of service-oriented architecure.................. 17 2.4 A common service lifecycle in a service-oriented architecure includes service publication, service discovery, and service invocation processes.................................. 18 2.5 Broker-based service discovery mechanism. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs.................................. 24 2.6 P2P-based discovery mechanism containing a data layer, a communication layer, and peers that control registries or service providers. 25 2.7 Summary of existing service discovery systems with different discovery mechanisms mapped relative to three characteristics: degree of decentralization, richness of service descriptions, and static or dynamic................................. 28 3.1 A manual phylogenetic data collection and data analysis process. 50 vii

3.2 MoGServ System architecture includes a services access client, MoGServ middle layer, and other data and services providers........ 54 3.3 Asynchronized services and workflow invocation model...... 58 3.4 A workflow built using Taverna workbench to get complete genome sequences and specific gene sequences................ 71 3.5 A workflow for querying two subset sequences from local database, filtering out sequences coming from same organism, and doing sequence alignment analysis...................... 72 3.6 Abstraction of user defined workflows................ 72 4.1 The growth of sequence databases (NCBI Genebank and EBI Swissprot) and annotations. This figure is from Folker Meyer[57]... 76 4.2 Entity relationship diagram of the data model in MoGServ created by SQL::Translator.......................... 90 5.1 A RDF graph model to represent some information for describing the MoG project web site...................... 95 5.2 Main concepts and partial relationships defined in the MoG application domain ontology........................ 101 5.3 The software components implementation of annotation and querying meta data............................. 105 6.1 A four level hierarchical workflow structure representation and transformation of scientific processes................... 109 6.2 An example illustrates the user-oriented workflow definition with different levels of knowledge..................... 112 6.3 An enhanced workflow system with two added components, knowledge management and knowledge discovery............. 115 6.4 The mismatching problem may be introduced due to the inaccurate annotation, incomplete semantic annotation, and inaccurate ontological reasoning during the translation process......... 122 6.5 The creation process of connectivity graph when a new service is added in the registry, the connectivity is refined and updated during the workflow translation process................... 124 6.6 The graph representation of a workflow for describing a scientific process................................. 128 A.1 Time line for the origin of life and major invasions giving rise to mitochondria and plastids.[27].................... 137 viii

A.2 Gene transfer to the nucleus. [27].................. 138 A.3 Symbioses process [69]........................ 139 A.4 ATP Synthase: the wheel that powers life. It is a candidate for ascertainment of deep phylogeny................... 140 B.1 The main menu of the MoGServ................... 142 B.2 A web interface provides users a way to define data with interests. 143 B.3 Input the query term from this interface and choose gene or genome database................................ 144 B.4 The results from querying local database.............. 146 B.5 Users may copy, past particular sequences and upload to the local database................................ 147 B.6 Set information............................ 148 B.7 The set filter service is used to find intersection of organisms among mutliple sets.............................. 149 B.8 tblastn interface in MoGServ..................... 150 B.9 ClustalW Interface in MoGServ................... 151 B.10 Job management interface shows the status, input link, output link of a job................................ 152 B.11 An example input of a clustalw analysis, set id is a hot link, users can view sequence information in this set.............. 153 B.12 An example output of a clustalw analysis, users can download, convert, view the results........................ 154 D.1 Phylogenetic tree generated from the PAUP............ 166 D.2 Phylogenetic tree file generated from the PAUP can be viewed by other program............................. 167 E.1 This is the WSDL description of QueryLocal service hosted in the MoGServ, which provides an operation to create a set in the local database. This operation accepts two parameters and return the set id.................................. 170 E.2 One example of using Taverna workbench to create, test, and run workflow. This workflow accepts users input, search the local database, create set, align set using ClustalW, convert the ClustalW result to NEXUS format, which can be fed to PAUP............. 171 E.3 XScufl workflow format represents the workflow created using the Taverna workbench........................... 172 ix

E.4 Annotation of job and set information using ontological model defined. The sample rdf file is displayed using RDF Gravity..... 173 E.5 Annotation of a service using ontological model defined. The sample rdf file is displayed using RDF Gravity.............. 174 x

TABLES 2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR WEB SERVICES........................... 20 2.2 EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR ATOMIC AND COMPOSITE SERVICES............. 33 2.3 LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES 36 3.1 ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIP- TION................................. 56 6.1 PERFORMANCE EVALUATION OF MATCH DETECTION PRO- CESS................................. 125 6.2 PERFORMANCE EVALUATION OF PATH SEARCHING PRO- CESS................................. 126 C.1 OPEN SOURCE SOFTWARE PACKAGES USED FOR DEVEL- OPMENT AND DEPLOYMENT.................. 156 D.1 NAME OF ATP SYNTHASE.................... 161 D.2 SYNTAX OF SEARCHING LOCAL DATABASE......... 161 D.3 INDEXING FIELD OF LOCAL DATABASE........... 162 xi

ACKNOWLEDGMENTS I would like to thank Dr. Gregory Madey for his encouragement and guidance on my research. Thanks for him always saying Life is short and his kindness, patience, and confidence in me. I appreciate him giving his students as much freedom as possible on selecting research topics for our best interests and seeking for collaborative opportunities to help us fulfill our goals. His spirit of never stopping to learn new materials and never afraid of exploring new research areas always encouraged me in the way to finish this dissertation and will encourage me with my future work. Thanks for his efforts on trying to educate us as independent researchers in numerous ways. Many thanks goes to Dr. Jeanne Romero-Severson for providing me use cases and training in the biological field, and her prompt feedback on my work. I would like to thank Dr. Amitabh Chaudhary for answering my questions about algorithms and discussion about my research topics. I would like to thank my committee members Dr. Patrick J. Flynn, Dr. Aaron Striegel, and Dr. Jeanne Romero-Severson for their valuable contributions. I would also like to thank my son for trying hard not to bother me too much while I was busing working and giving me excuses to relax. Many thanks go to my husband, my parents, and my friends for their emotional support, always, no matter how much frustration I had. This research work is partially supported by the Indiana Center for Insect xii

Genomics (ICIG) with funding from the Indiana 21st Century fund. xiii

CHAPTER 1 INTRODUCTION Since the first generation of the World Wide Web (the Web) appeared in 1990, it mainly served as a repository for text and images presented in HTML format. Nowadays, the Web is evolving as a platform to publish and access dynamically changing new types of content provided by a variety of services that are realized with web-accessible programs, databases, and physical devices. Tim Berners-Lee et. al. [8] presents the evolution of the Web (see Figure 1.1); the authors emphasize the importance of understanding the current, evolving, and potential Web in the article Creating a Science of the Web. The Web has been used in e-commerce and Business-to-Business (B2B) applications to deliver information and provide services to customers and business partners. For example, a travel agency provides services for travelers to view and compare airfare, book tickets and hotel on-line. As the transaction of services between businesses increases, there is a demand of increasing the interoperability between these applications, the service-oriented architecture (SOA) is proposed as an underlying architecture to enhance this capability. With many definitions and non-standard definitions, the service-oriented architecture (SOA) is commonly accepted as a new architectural style that enables the combination and communication among loosely coupled services. These services are described with a standard interface definition that hides the implementation of the language and 1

Browser Browser, blog, blog, wiki, wiki, data data integration Privacy, security, accessibility, mobility HTML HTML Interaction Semantic Web Web Web Web Services Multimodal XML XML HTTP HTTP URL URL HTTP, HTTP, SOAP, SOAP, URI URI Yesterday Today This picture is adapted from the article Creating a Science of the Web by Tim Berners-Lee et. al. Figure 1.1. The evolution of the Web, yesterday s web is a repository for text and images; today s web is a platform to publish and access dynamically changing new types of contents provided by a variety of services.[8] platform of services in a SOA. A service can be called to perform a task without the service having pre-knowledge of the calling application, and without the application having or needing knowledge of how the service actually performs its tasks. The realization of a service-oriented architecture is not tied to a specific technology and protocols. The web service standards, including SOAP, WSDL, and UDDI, have been widely accepted as the realization of a SOA with support from a number of tools. Therefore, the service-oriented architecture is often defined as services exposed using this web service protocol stack. A SOA based sys- 2

tem can therefore be referred to as a system developed using these technologies. Building a SOA based system can help businesses respond more quickly and costeffectively to the changing market conditions. It promotes reuse of existing legacy applications as services and simplifies the interconnection of distributed business processes inside organizations or across organization boundaries. As stated in the article [8], the Web has changed the ways scientists communicate, collaborate, and educate. The evolving process of using the Web in the e-science field is similar to the evolving process of using the Web in the e- business domain. The effort of building the e-science infrastructure started from developing gateways or portals that provide access to integrated databases and computing resources behind a web-based user interface in multiple scientific fields. Examples of this kind of science include social simulations, physics, environmental sciences and bioinformatics. This infrastructure has been used to solve problems such as distributed physical or astronomic data analysis, and remote access of the information source and simulations. It facilitates the use of the computational resources located in different physical sites, thereby allowing users at different locations to easily share information and communicate with each other. More recently, the service-oriented architecture along with the combination of semantic web, Peer-to-peer (P2P) computing, and grid computing technologies are being identified as promising ways to build such infrastructures for supporting e-science by providing access to heterogeneous computation resources and integration of distributed scientific and engineering applications developed by individual scientists and groups [91] [93]. With the promising future of adopting the service-oriented architecture in e- Science and e-business, a number of challenges arise in term of integrating inde- 3

pendently developed data systems without requiring global agreements as to terms and concepts, efficient allocation of computation resources, security and privacy issues of accessing shared data resources. These challenges attract researchers from diverse research areas such as information retrieval, database system, artificial intelligence, software engineering, and distributed computing. 1.1 Main contributions of the dissertation Our research work starts from an investigation of current research trends and challenges in the SOA area. In order to discover the best practices for building SOA based systems, we demonstrate our design and implementation of a SOA based system to support scientific research and increase productivity. It serves as a prototype for our future research work in this field as well as an in-silico investigation platform for scientists. A particular scientific domain studying the deep phylogeny of the plant chloroplast is applied in this prototype. This application shows that a SOA based system can help scientists achieve a research goal that it is difficult and almost impossible without this system. We conduct our research from both practical and theoretical aspects. We propose a hierarchical structure for workflow by integrating semantic web technology to improve the reuse of workflows. To address the security and resource allocation issue, we propose integrating the current system with an existing grid computing platform. The main contributions of this dissertation are: A survey and analysis of current trends and research challenges in the service-oriented architecture: Grid computing, peer-to-peer computing (P2P), and semantic web technologies are related to SOA. A recently proposed grid standard, Open Grid Service Architecture (OGSA), built upon the service-oriented 4

architecture, demonstrates the convergence of grid computing with SOA. Semantic web technology is used in grid services and SOA to enhance the automation of scientific and engineering computational workflows. Applying P2P technology in SOA makes service discovery and enactment more scalable than centralized approaches. Much research has been done exploring the convergence of these technologies so as to make this new distributed computing paradigm successful. We present our investigation of the research issues and challenges in SOA. Our discussion of open issues and future research trends focuses on several critical aspects in SOA: service discovery, service composition, and service enactment. A Service-oriented data integration and analysis environment for In Silico experiments and bioinformatics research: As more public data providers begin to provide their data in web service format in order to facilitate better data integration in bioinformatics community, we designed and implemented a service-oriented architecture that integrates the data and services to support a deep phylogenetic study. This software environment focuses on representing both data access and data analysis as web services. We believe with this common interface, it will be easy for other researchers who are interested in deep phylogenetic analysis to integrate our data and services into their applications. Based on a first prototype, we discuss several issues in the implementation and indicate the possible integration with semantic web and grid computing technologies to address these limitations. We present a practical experiment of building a service-oriented system upon current web services technologies and bioinformatics middleware. The system allows scientists to extract data from heterogeneous data sources and to generate phylogenetic comparisons automatically. This can be difficult to accomplish using manual search tools since sequence data is rapidly 5

accumulating and the process can be long and tedious. An application for exploring the deep phylogeny of the plastids with the SOA based system: To serve as an example and proof of concept that the service-oriented architecture can help scientists increase their productivity and solve more complex problems than possible with the traditional approaches, we apply several use cases on the system. We detail the services provided in this environment and illustrate the results which demonstrate that the environment can help support scientific analysis and make new discoveries. A methodology and a novel approach to facilitate the reuse of workflow and composition of services: Most current practical methodologies for creating workflows relies heavily on users having complete knowledge and understanding of individual services at a low-level description. Using semantic web technology, services can be described with rich semantics. Recent research has focused on supporting users in the discovery and composition of services by using rich service annotations. Users can choose to encapsulate a service in a workflow to achieve particular goals based on the conceptual service definition in semiautomatic and automatic ways. Most current practical methodologies for workflow creation pursue this using a semi-automatic way that allows users to discover and select appropriate services to include in a workflow based on the semantic and conceptual service definition. This effort lifts the load of requirement on bioinformatics researchers of having detailed knowledge and understanding of each tool, service, and data types. Instead, more complex middleware is used to assist with the composition process and resolve the incompatibility between two given services. Few approaches consider the potential of reuse of existing workflows or partial reuse of these workflows. We present a hierarchical workflow structure 6

with a four level representation of workflow: abstract workflow, concrete workflow, optimal workflow, and workflow instance. This four level representation of workflow provides more flexibility for the reuse of existing workflows. We believe that reuse of complete or partial workflows takes advantage of the verified knowledge learned in practice and can increase the soundness of the composed workflow. We proposed an ontological representation model of data and services as well as an approach that uses a graph matching algorithm to find similar workflows with semantic annotation. 1.2 Organization of the dissertation The rest of this dissertation is organized as follows: Chapter 2 introduces several concepts and technologies related to SOA and discusses related research issues and challenges. Chapter 3 presents the design and implementation of a SOA based system for supporting bioinformatics research. Chapter 4 demonstrates a particular application that uses this system to discover new phylogenetic knowledge. Chapter 5 presents an ontological model to annotate services and data. This semantically enriched data allows easier reuse, sharing, and experiments involving search to be conducted. Chapter 6 proposes a methodology and a novel approach that can facilitate the reuse of workflow and composition of services. Chapter 7 summarizes the dissertation and identifies potential future work. 7

CHAPTER 2 RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED COMPUTING 2.1 Introduction The evolution of computing systems progressed through monolithic, clientserver, 3-Tier to N-Tier architectures. The N-Tier architecture layers request and response calls among applications that may reside on multiple sites. Serviceoriented computing (SOC), an term frequently used interchangeably with the service-oriented architecure (SOA), involves service layers, functionality, and roles as described by SOA [70]. SOA can be considered as a conceptual description of a concrete implementation of a service-oriented computing infrastructure. It is an emerging paradigm for distributed computing intended to enable systematic application-to-application interaction. Services are basic units on a serviceoriented computing platform. They are autonomous, platform-independent software components that can be described, published, discovered, invoked, and composed using standard protocols within and across organizational boundaries. A service is a piece of work done by a Service provider in order to provide desired results for a Service requester. Service providers and requesters are roles played by software agents on behalf of their owners. The goal of this new distributed computing architecture is to enable interaction among loosely-coupled software agents in a flexible and effective way. 8

SOC has been adopted in portal design, e-commerce, e-science, legacy system integration, and grid computing. One example is the integration of engineering design processes, such as automobile and aircraft design, which typically involve several partners located at different locations. These partners may be both cooperative and competitive. Successful engineering design requires well-coordinated interactions between individuals or teams in specialized knowledge domains, information exchange, models, and integration to achieve an optimal goal. However, there may be a significant part of design models and tools containing proprietary information that cannot be disclosed. Also, these models and tools are normally written in a variety of programming languages and run on different platforms. With service-oriented computing technologies, these models and tools can be treated as black boxes and run at their original locations [5] [43]. Reusability, interoperability, security, and easy maintenance are major potential benefits of SOC. Reusability services provide a higher-level standard abstraction that allows the reuse of existing software. Interoperability The standard abstraction of services enables the interoperation of software produced by different programmers and improves productivity. Security With the standard abstraction of services, software can be viewed as a black box. The internal implementations or algorithms are not accessible to competitive partners. Maintenance With the standard abstraction of service, changes to the underlying implementation will adversely impact the use of the services. 9

While the potential benefits of SOC are compelling, successful service-oriented implementation requires solving several issues and challenges arising from these promising features. These issues and challenges include service discovery, service composition, and service invocation; monitoring the execution of services; methodologies supporting services development, evaluation, and life-cycle management; approaches to guarantee quality, security, and reliability of services. These challenges attract researchers from diverse research areas such as information retrieval, database systems, artificial intelligence, software engineering, and distributed computing. In this chapter 1, we introduce several concepts and technologies related to SOC and discuss related research issues and challenges. 2.2 Overview of related concepts and technologies Several definitions of SOA are available; the W3C defines SOA as a form of distributed systems architecture with the following properties: [105] The service is an abstracted, logical view of actual programs, databases, and business processes. A service or a function is described using a description language. Services tend to use a small number of operations with relatively large and complex messages. Services tend to be oriented toward use over a network. 1 Portions of this chapter appear in A semantic web services enabled web portal architecture, International Conference of Web Services (ICWS2004)[108] 10

Messages are sent in a platform-neutral, standardized format, such as XML, through the interface. XML is the most obvious format. The service is implemented as a software agent. The service is formally defined in terms of the messages exchanged between provider agents and requester agents, and not the properties of the agents themselves. By avoiding any knowledge of the internal structure of an agent, one can incorporate any software component or application that can be wrapped in message handling code that allows it to adhere to the formal service definition. There are two fundenmental components in a basic service-oriented architecture as shown in Figure 2.1. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester. The request and subsequent response connections are defined in some way that is understandable to both the service requester and service provider. 2.2.1 Web services Although there is no standard definition of web services, a web service is generally considered as one type of realization of SOA. Among various definitions, we refer to the definition from W3C: A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards 2. 2 http://www.w3.org/tr/ws-arch/ 11

Service description Software Agent Implement The service Send request in XML format Internet Software Agent Has knowledge Of the service In terns of the Description not The implementation Service Provider Return results in XML format Service Requester Figure 2.1. Two basic components in a simple service-oriented architecture. A service requester at the right sends a service request message to a service provider at the left. The service provider returns a response message to the service requester. Concrete software agents that implement an abstract service interface can be written in different programming languages and can run on different platforms. Since these concrete agents implement the same function defined in the abstract interface, any change of underlying implementation will not effect on the use of the service. A web service architecture is based upon many layered and interrelated open standard and web technologies as shown in Figure 2.2. The Web Service Description Language (WSDL) defines the abstract interface of services. The Simple Object Access Protocol (SOAP) is a protocol for exchanging messages among requesters and providers. Universal Description, Discovery and Integration (UDDI) provides a standard registry for publishing, discovery, and reuse of web services. WSDL, SOAP, and UDDI are core standards based on fundamental web technologies including XML, TCP/IP, FTP and etc. There are also emerging standards proposed for defining business, scientific, or engineering processes, 12

transactions, and security, e.g., BPEL4WS, WS-I. Two main styles of web services are available: SOAP web services and REST (Representational State Transfer) 3 web services. In this dissertation, we use the term web services to mean SOAP style web services. Additional Additional WS* WS* Standards Standards Business Business Process Process Execution Execution BPEL4WS, BPEL4WS, WFML, WFML, WSFL, WSFL, BizTalk, BizTalk, Service Service Publishing Publishing & Discovery Discovery UDDI UDDI Transactions Transactions Management Management Security Security Services Services Description Description WSDL WSDL Services Services Communication Communication SOAP SOAP Meta Meta Language Language XML XML Network Network Transport Transport Protocols Protocols TCP/IP, TCP/IP, HTTP, HTTP, SMTP, SMTP, FTP, FTP, etc etc Figure 2.2. Web services standards stack includes mutliple layered and interrelated open standards. 2.2.2 Semantic web The vision of the semantic web is to represent units of web-based information with well-defined and machine-understandable semantics so that intelligent software agents can autonomously process them [7]. This information, including these 3 http://www.xfront.com/rest-web-services.html 13

abstract description of services, must be defined and linked in such a way that it can be used for automation, sharing, integration, and reuse even when these software agents are designed, developed, and owned by different groups or individuals. SOA, more specifically web services, becomes a key component to realize the vision of semantic web since most web sites on todays web do not merely provide static information but allow users to interact and generate dynamic information through services. To make use of a web service, a software agent needs a computer-interpretable description of the service. Adding meaningful descriptions to the interface using semantic web technology can avoid ambiguous interpretations of information and service descriptions and increase the soundness of the results provided by service providers. The combination of these two technologies results in the emergence of a new generation of web services called semantic web services [54]. The proposed standards for knowledge sharing and reuse in the semantic web range from the Resource Description Framework (RDF) to the Web Ontology Language (OWL) [67]. These two standards have become W3C recommendations. The appearance of open source tools that support creation, parsing, and reasoning using these standards makes the addition of semantic web technology into SOC feasible. 2.2.3 Grid computing Grid computing [32] is a computing platform that is intended to integrate resources (both data and computational resources) from different organizations, called virtual organizations, in a shared, coordinated and collaborative way to solve large-scale science and engineering problems. The Globus toolkit [97] is one implementation of the specifications for grid computing. It has become the 14

standard for grid middleware. Open Grid Service Architecture (OGSA), built upon the service-oriented architecture, describes a service-oriented architecture for grid computing. The Open Grid Services Architecture (OGSA) describes an architecture for a service-oriented grid computing environment for business and scientific use, developed within the Global Grid Forum (GGF). OGSA is based on several other Web service technologies, notably WSDL and SOAP. It is a distributed interaction and computing architecture based around services, assuring interoperability on heterogeneous systems so that different types of resources can communicate and share information. The major goal of the grid computing platform is to provide an easy-to-use and flexible computing infrastructure for supporting e-science. The goal of e-science is to offer scientists and engineers an effective way to generate, analyze, and share their experiments, data, instruments, computational tools, and results. Seamless automation of the scientific process becomes a major gap between the vision and reality. Grid computing shares some of problems and technical challenges with service-oriented computing in general. Incorporating semantic web technologies into grid computing bring us a new concept, the semantic grid [21], which intends to minimize this gap and solve the problem of achieving seamless integration and automation of scientific and engineering workflows. 2.2.4 Peer-to-peer computing Peer-to-Peer (P2P) computing has received significant attention due to the popularity of P2P file sharing system such as Napster, Gnutella, Freenet, Morpheus, BitTorrent, and KaZaa. Peers are autonomous agents and exchange information in completely decentralized manner. P2P architecture does not have 15

a single point of failure. Since nodes contact with each other directly, the information they receive is up-to-date. The P2P model can provide an alternative for service discovery dynamically without relying on centralized registries. The P2P model also provides an alternative for interaction between web services. We discuss the research done on this direction in the following sections. Semantic web technology enhances the capability of automation in SOA and grid computing. Grid computing building upon SOA increases the flexibility. P2P computing model increases the scalability and reliability. Figure 2.3 demonstrates an overview of current research trends that intend to use these technologies together. 2.3 Issues in the service-oriented computing Figure 2.4 shows service publication, service discovery, and service invocation stages in the life cycle of a service. This process involves three roles in the SOA: service provider, service requester, and service discovery system. Service providers create services and provide platforms to execute these services. Service requesters query the service discovery system to find appropriate services. To enable service requesters to find services, service providers need to publish their services interface in a publicly available location. Specifying the capability and quality of services, and finding a matched service based on these descriptions are usually done as two separate activities. The more information that is given for describing services, the more accurate are the matched results that are returned. Services can be categorized into simple services (atomic services) and complex services (composite services). Generating and executing a composite service to solve a complicated problem is an important feature leading to the adoption of SOA. 16

Figure 2.3. Venn Diagram representation of integration web service, grid computing, semantic web, and peer-to-peer technology into the realization of service-oriented architecure In the following sections, we discuss several active research issues in SOA, service description, service discovery, service composition, and service execution. 2.3.1 Service description One requirement of the services oriented architecture is to provide meaningful descriptions for services so that software agents can understand their features and learn how to interact with them. A service description gives a formal representation for properties of a service. These properties can be classified into funcational and non-functional properties. Functional properties contain the details of a service interface and service 17

Discovery 2 3 Service Consumer 5 Invoke 4 Service Broker 1 Publish Service Provider Figure 2.4. A common service lifecycle in a service-oriented architecure includes service publication, service discovery, and service invocation processes. behavior including data types, operations, transport protocol information, and binding address. WSDL is the first W3C standard that is widely used for service descriptions. There may be multiple service providers who offer the same functionalities defined in a service interface. Determining and choosing the best service becomes important for service requesters. The information in WSDL descriptions is not sufficient for ranking best services. Non-functional properties including specification of the cost, performance, security, and trustiness of a service are introduced for measuring the Quality of Services (QoS). There are many aspect of QoS that can be organized into categories with a set of quantifiable parameters [75]. The best service may have different meanings for different requesters. One may prefer security over cost while the other may prefer lower cost over performance. Measurements of these non-functional properties can be achieved using statistical analysis, data mining, and text mining technologies. It is normally done by a third-party through the collection of subjective evaluations from requesters. This information dynamically changes over time. 18

Pure syntactic descriptions of services require requesters to fully understand the capability of a service before using it. The selection of a web service among several ones with similar WSDL descriptions requires more information than what WSDL actually defines. The semantic web, supported by the use of an ontology, is likely to provide better qualitative and scalable solutions to overcome these issues. There are two directions to enhance the semantics in the web service description (See Table 2.1). 1) enhance the WSDL description. The Semantic Annotations for Web Services Description Language Working Group [81] has the objective to develop a mechanism to enable annotation of Web services description. This mechanism will take advantage of the existing WSDL standards (WSDL 2.0) to build a simple and generic support for semantic in Web services. Some systems [54] [55] define an ontology for web services using emerging languages, such as DAML+OIL and OWL. 2) Second, the W3C recently proposed OWL-S to provide the ontology description of web services using OWL. OWL-S enables description of not only the functional properties of a service, but also the non-functional properties. This domain-independent service ontology is augmented by domain-specific ontologies in real applications. Enhancing service descriptions with ontological representations increases the cost and complexity of services annotation from several aspects. Creation of domain ontology Use of ontologies is considered to be the most promising basis for defining the semantics of objects and allowing meaningful information exchange among machines and humans. A commonly used definition of ontology is a specification of a conceptualization [40]. An ontology is intended to give a concise, uniform, and declarative description of information and knowledge that is interesting and useful to a community 19

TABLE 2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR WEB SERVICES Description methods Representation Challenges syntactic WSDL No representation of non-functional properties, not sufficient in representing meaningful description, no representation for process, only supporting the keywords search semantic domain ontology + WSDL domain ontology + OWL-S No representation for process, complexity of services annotation Complexity of services annotation of users, using a common vocabulary and language. Construction of a knowledge base involves investigating a particular domain, determining important concepts in that domain, and creating a formal representation (ontology) of the objects and relations in the domain. A general ontology represents a broad selection of objects and relations at a higherlevel of abstraction [79]. Miller et al. [59] investigate ontologies for simulation modeling. Christley et al. [15] presents an ontology for agent-based modeling and simulation. An ontology is normally defined and revised (if needed) by an authority. Usually the authority needs to collaborate with the real experts in the domain before or during the process of creating formal representations. Largescale ontologies can be constructed by publishing a prototype ontology for the research community. The Gene Ontology (GO) Consortium produces 20

a controlled vocabulary for classifying gene product attributes, molecular functions, cellular components and biological process [35] in the biological sciences field. It consists of 17838 terms (as of September 27, 2004) and 22742 terms (as of March 11, 2007). Integration of ontologies Vast amounts of information may come from many different ontologies. For this reason and because many heterogeneous data repositories are developed by different research groups and reside on different research institutes and organizations, it is impossible to process this information and data without the knowledge of the semantic mapping between them. Much research has been done to explore the mapping and matching of concepts, and integrating different ontologies using sophisticated algorithms and AI techniques, such as machine learning [25][62]. There are two approaches for ontology integration. One approach involves integration of different ontologies that are developed by different groups for data representation into a common global ontology. While this approach makes the information correlation in the query processing easier, it increases the complexity of integrating the ontologies and maintaining consistency among concepts. The other approach is interoperation across different ontologies via terminological relationships between terms instead of integration of ontologies into a global one [56] [66]. Interontological relationships are specified using description logics in an interontological relationships manager to handle vocabulary heterogeneity between ontologies. Although the adoption of this approach increases the scalability, extensibility, and maintainability, it shifts the burden to the interoperation mechanisms. Annotation of services The annotation of services using ontologies is gener- 21

ally done manually. It is a complex process since there may be multiple ontologies related to a single service. These ontologies may be developed by different groups. Each group may represents the same concept using different vocabulary or different concepts are presented using the same vocabulary. Some systems, such as MWSAF (Meteor-S web service annotation framework) [71] provides graphical tools that enable users to annotate existing web services description with ontologies in a semi-automatical way using AI technologies such as machine learning. The IBM ETTK [30] technology provides a set of toolkits including a graphical editor for annotating service compatible with WSDL-S. 2.3.2 Service discovery Without prior knowledge of a service, service requesters may not know the location or even the existence of services they desired. A goal of the service discovery process is to find services that are best suited for the requirement of the requester. A basic service discovery process can be described as follows. 1. Service providers provide descriptions of their services and advertise these services in a service registry. A service registry is a service discovery system that consists of mechanism for supporting efficient searching appropriate services and physical spaces for storing characteristics of services. UDDI is a registry standard. 2. Service requesters request desired services using keywords or complicated query languages. 22

3. A service discovery system accepts requests from requesters. It searches service descriptions in its database and tries to find services that match requests. This process is also called matchmaking. As the number of web services grows, new registries appear as needed. A service may be registered in several registries. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs (See Figure 2.5). In this mechanism, the broker may issue the request to multiple registries in parallel, however, there is still a communication bottleneck to the broker and a single point failure may occur. An alternative of the centralized discovery mechanism is the P2P based discovery mechanism. In this approach, each service provider acts as a peer in the P2P network. Each provider has its own way to store information about other service providers, called neighbors, and provides the resources to relay or pass information through. A network like a social network is eventually formed. At the discovery process, a requester queries its neighborhoods for searching a desired service, the query propagates through the network until a suitable one is found or terminates [105]. This approach provides higher reliability than a centralized approach. It avoids the single point of failure and the latency of providing up-todate description for updated services. However, since each service provider is a peer, a huge peer community may result in inefficient search. Instead of treating every provider as peer, each registry can act as a peer in the network to overcome the problem. Much research has been done for realizing a P2P discovery mechanism. Schmidt and Parashar present a P2P based keyword web service discovery system on the 23

Service Providers A B C Publish services into one or multiple service registries Service Registries a b c Service Discovery Broker Broker Handle queries queries from from requesters Translate queries queries into into appropriate formats formats needed needed by by each each registry registry Communicate each each registry registry Unify Unify and and distill distill results results returned from from registries Send a request for inquiry services using broker required syntax Receive the results from broker Service Requesters 1 2 3 Figure 2.5. Broker-based service discovery mechanism. A service discovery broker accepts requests from service requesters, translates requests into appropriate formats, and sends them to multiple registries. The returned results may be unified and distilled based on requesters needs. Chord overlay network [82] 4. In this system, a set of keywords is extracted from the web services descriptions. These web services descriptions are indexed using these keywords. The index is stored at the peers in the P2P system. Each web service description is mapped to the index space. The underlying node joins, departures, failures, and data lookup are build upon the Chord network s lookup protocol. Speed-R [88] is a JXTA based P2P network system supporting semantic publication and discovery of web services. In this system, each service registry is controlled by a peer. Dogac et. al. [26] describe a way to expose the semantic of web service registries and connect the service registries through a P2P network for the travel industry. A general P2P discovery system (See Figure 2.6) contains 4 http://en.wikipedia.org/wiki/chord project 24

a data layer, a communication layer, and peers that control registries or service providers. The data layer can be formed by registries or service providers. Communication layers are implementations of P2P network, such as JXTA and Chord. Semantically enriched services and registries make the automation of the service discovery and the discovery of service registries possible. Peer A Peer B Peer N Semantic Enriched Services Or Registries Description Using ontology JXTA JXTA Chord Chord Communication Layer Registry A Registry B Registry N Data Layer Figure 2.6. P2P-based discovery mechanism containing a data layer, a communication layer, and peers that control registries or service providers. The traditional service discovery method, static discovery or manual discovery, relies on humans intervention by using a discovery system to locate and select a service description that meets the desired criteria at design time. The dynamically changing service environment requires service discovery that should be possible using a software agent during run time. The realization of the dynamic discovery mechanism needs machine processable semantics to describe services. The implementation and performance of a service discovery system depends on the available information in service descriptions. The more information the 25

system can gather, the more accurate results the system can give back to the requester. The implementation also depends on the kind of query that can be given by the requester. Two examples are: give a forecast service, give a forecast service which has fastest response time. For the first query, a simple key-word based discovery system is sufficient. For the second query, the discovery system needs to gather information on quality of service, find several forecast services in registries and rank them based on response time. The service discovery problem is related to the information retrieval problem. Two key quality measurements in information retrieval are also applicable when evaluating the performance of service discovery systems [45]. Recall is the number of relevant items retrieved, divided by the total number of relevant items in the collection. Precision is the number of relevant items retrieved, divided by the total number of items retrieved. The discovery mechanism in the traditional UDDI standard that only supports the static service discovery has been recognized as insufficient. This discovery mechanism often gives no result at all or gives many irrelevant results because keywords are a poor method to capture the semantics of a request. Synonyms (syntactically different words may have the same meaning) and homonyms (same words may have different meaning in different domain) can not be distinguished in a keyword-based retrieval. Also, relationships between different keywords in a request can not be captured. This mechanism offers low retrieval precision and recall. WordNet [102] is used to handle the synonyms and to employ an information retrieval model in the service retrieval process [99] so as to improve the precision and recall. WordNet is a lexical reference system developed by the Cognitive Sci- 26

ence Laboratory at Princeton University. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept, and the synonym sets are linked with different relations. WordNet is distributed as a data set. However, WordNet only supports the query of common words; vocabularies for a particular domain are most-likely not included in WordNet. With rich formal semantic descriptions added to web services, a service discovery system can provide more accurate results with high precision and recall. It also reduces human interference with the discovery process and makes the dynamic discovery possible. Therefore, semantic web technologies become a solution for this matchmaking process [47] [26] [28]. In the mean time, quality of service becomes an interesting topic for selecting optimal services from a subset of services that have the same functionality the requester asked for [10] [53] [48] [19]. Two types of semantic descriptions result in two types of semantic discovery system: (1) Adding semantics to current web services standards (UDDI and WSDL) [85]; (2) Using DAML-S and OWL-S to represent both functional and non-functional properties of web services enables software agents or search engines to automatically find appropriate web services via ontologies and reasoning-algorithm enriched methods. However, the high cost of formally defining heavy and complicated services makes adoption of this improvement unlikely in the current stage. Figure 2.7 shows, in three dimensions, existing service discovery mechanisms currently used in implementations of service discovery systems. A is a keywordsbased system, such as traditional UDDI. B is semantic enriched UDDI systems [85]. C is keywords-based P2P systems [82]. D is semantic-based systems with DAML-S or OWL-S [47] [26] [28]. E is semantic-based systems on P2P network 27

[88] [26]. Decentralized (P2P) C Dynamic E D Static B Centralized A Keywords based Semantic based Figure 2.7. Summary of existing service discovery systems with different discovery mechanisms mapped relative to three characteristics: degree of decentralization, richness of service descriptions, and static or dynamic. The research challenges residing in the service discovery process may suggest a way to integrate semantic and P2P technologies for building a discovery system. This system should allow automatic service discovery and provide high precision and recall at the same time, however, the cost of implementing this system makes it hard to be adopted at this time. 28

2.3.3 Service composition One of the most attractive features of service-oriented computing is that atomic services can be combined into a large application to solve complicated problems. The orchestration of a set of services to accomplish a larger and sophisticated goal is called a workflow. In the business world, a workflow is referred to as a business process. In the scientific domain, a workflow is sometimes referred to a scientific process. Several different approaches and platforms are being developed to achieve the common goal of the web service composition. These approaches range from adoption of industry standards to adoption of semantic web technology, from manual or static composition to automatic dynamic composition [90]. Since there is no standard service composition specification, each approach and platform defines its own way for service composition, provides its specifications and languages, and executes the workflow on a specific workflow execution engine. Current solutions for web service composition include the adoption of industrial standard, semantic web technologies [86] [29] [41], web components [111], Petri nets [112], and so on. The long term goal of a successful composition mechanism should meet several requirements: connectivity, quality of service, correctness, and scalability [58]. Adoption of industrial standards and adoption of the semantic web technologies are two active research areas among current service composition mechanisms. Both of these mechanisms support complex process activities, such as sequences, branching, etc. Current industrial standards include WSDL, UDDI, SOAP and a set of workflow specification languages (BPEL4WS, WSFL, BPML, WSCI, and XLANG) 29

used to support the data flow and control flow representations [98]. Among all of these specifications, BPEL4WS is the most mature and widely supported by the industry and research community. Service compositions described in the BPEL4WS format may be deployed on execution engines, such as BPWS4J [11] and Collaxa BPEL server [17]. The other model approach is based on semantic web technologies and AI planning techniques [84] [13]. In this model, services are semantically annotated with RDF/RDF Schema, DAML-S, or OWL-S. The objective is to enable automation of web service discovery, invocation, composition and execution. However, there is limited implementation and product support for generating service descriptions automatically at the current research stage. Most service composition models require application developers to possess complete knowledge of available services and the exact process logic. It depends on developers to choose a particular service at each step. Adoption of semantic web technologies allows automation of the composition process to be possible. There are two type of automation, semi-automatic and automatic. Both of them require the existence of domain ontology. The typical system [84] using the semi-automation method maintains a knowledge base which contains ontology of services, such as DAML-S or OWL-S. A matchmaker is used to find a service with required functionality. All the optional services that meet the requirement are presented to user with ranking of the quality at each step. The user makes a choice and continues the process. A typical system using the automatic method is often cooperating using AI planning technology [13]. The composition process starts from an explicitly defined goal. The workflow composition engine lets the service requester provide the input and output information. This information is 30

fed into an AI planner. The planner returns one plan, multiple plans, or no results to the end user for a further decision. Although the service composition problem is highly related to the AI planning problem, the current planning technologies can not be directly applied [90]. Services are dynamically changed and may have voluntary failure during execution time. A composed workflow that does desired work at one time may not work at another time. Preventing run time failure at design time is important. An issue in the automatic composition of web services is defining the compatibility [55] or connectivity [58] of services. It can be a time comsuming process to check if services to be composed can actually interact with each other. For example, the output of one service is a required input of the subsequent service in a workflow. It also requires a way to verify the soundness and correctness of the composite services. Much research has been done to explore this using AI planning techniques for automation of the composition process. It is still an open research problem whether or not it is possible to use or extend the current planning techniques in the service composition process and modeling of services. The application used most to motivate research in automatic service composition is a virtual travel agent example; typically, the motivations lack a real world example. This approach now may be practically used in domains with well-defined ontologies and a small number of available services in that domain. We believe the semi-automatic approach is more practical when large number of services exist in the domain. 31

2.3.4 Service execution Service execution is a process in which an atomic service or a composite service is invoked and results are returned to requesters. Atomic web services can be created with different languages and deployed on various platforms. Two major platforms are J2EE and.net. Since execution of atomic services does not require results from other services, the technologies to support atomic services are relatively mature (See Table 2.2). Service execution for composite services depends on the composition model and the existing execution engine support. The industrial standard based model can be transferred to a particular workflow specification, such as BPEL4WS, and executed on a workflow engine. The semantic web based model can be represented using the DAML-S specification and executed on a DAML-S Virtual Machine [84] or OWL-S execution engine. Since there is no standard service composition specification, each composition approach and platform provides its own specifications and languages for composite services and executes the workflow on a specific workflow execution engine. There are also composition toolkits that convert the visual graph composition of service into a language-specific workflow. Several issues exist in the service execution process. Synchronized vs. Asynchronized communication Web service technology is message passing oriented; the architecture should be able to support different message passing methods. Most service-oriented frameworks only provide support for synchronous invocation, such as Axis [3], which blocks the process before the response from service provider arrives. The loose coupled nature of web service requires more flexible invocation method. The requester should not be blocked because it is waiting for the response from 32

TABLE 2.2 EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR ATOMIC AND COMPOSITE SERVICES Service type Specification Execution Engine Atomic service WSDL Implemented using Java, C++, Perl, Python on.net, J2EE, gsoap, SoapLite OWL-S OWL-S execution engine Composite service BPEL BEPL4J OWL-S DAML-S XScufl OWL-S execution engine DAML-S virtual machine Freefluo providers. Various research has been done to support this asynchronized communication method [107] [113]. Centralized vs. Decentralized execution of composite web services Although most of the composite service execution engines invoke an individual atomic service on distributed service providers, the engine acts as a centralized coordinator for all interactions among these atomic services. Decentralized execution allows independent sub-workflows to interact with each other without any centralized control. It can reduce the amount of network traffic. Mangala Gowri Nanda et. al.[60] present an algorithm that partitions a composite services in BPEL into independent sub process. Each service provider should host a BPEL engine. Their experimental results show that decentralized execution can increase throughput substantially. Roger Weber et. al.[100] present a peer-to-peer based execution systems. In this system, 33

when a node finishes its part of the work, then the data is migrated to nodes offering a suitable service for one of the next steps in the process. Boualem Benatallah et. al.[6] present an environment where a composite service can be executed in a decentralized way within a dynamic environment. Monitoring service and workflow execution One of the issues in service execution is that it is possible the selected service in the workflow is unavailable or temporarily off-line. The execution engine then invokes the alternative service if one is defined in the workflow at the service composition stage. Service execution often needs a duration to be completed. Service requesters may require a monitoring service so that they can query the status of their requested services. Monitoring the service execution status is another important issue. The experience in grid computing research may be adopted in the SOA for building a reliable infrastructure for service execution. 2.4 Service-oriented computing in e-science An individual life sciences researcher or research group starts a scientific project by developing hypotheses, designing experiments to test those hypotheses, collecting observational data, and publishing results. The published data allows other researchers to build upon or verify the results. With the assistance from computer software, users can import the raw data, click on buttons, and retrieve the results. The analysis process, however, requires certain knowledge of how to use these toolkits and how to access these data from different locations. Even for users who posses this knowledge, this manual analysis process is a bottleneck when large data sets are involved. As the World Wide Web becomes a platform for scientific study (e-science), research data can be published on the web to be shared with other 34

researchers. These data can be distributed in various formats (such as RDBMS tables, text files, or XML documents) depending on the preferences and needs of research groups. Manually accessing these data files becomes difficult as these data may come from different institutes, different research groups, and in different formats. There is a need for a methodology that frees users from having to locate the data sources, interact with each data source, and manually combine data in multiple formats from multiple sources. Applying semantic web and web services technologies to support life sciences research becomes a promising solution to this difficulty. As the adoption of web services in the life sciences field grows, many large public resource sites are publishing web services interfaces in WSDL format to allow their data and analysis tools to be accessible to the research community, see Table 2.3. 35

TABLE 2.3 LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES Service Provider Description Resources URL NCBI (the National Center for Biotechnology Information) Provides a varitiey of E-Utility web services to allow data retrieval against the NCBI database using WSDL and SOAP http://www.ncbi.nlm.nih.gov/entrez/query/ static/esoap help.html EMBL-EBI (the European Bioinformatics Institute) Provides a number of web services for data retrieval, data analysis tools, and ontology lookup using WSDL and SOAP http://www.ebi.ac.uk/tools/webservices/ DDBJ (the DNA Database of Japan) Provides web services for data retrieval, data analysis against DDBJ database using WSDL and SOAP http://xml.nig.ac.jp/index.html KEGG (the Kyoto Encyclopedia of Genes and Genomes) Provides web services for data retrieval and data analysis against KEGG database http://www.genome.ad.jp/kegg/soap/ SeqHound Provides web services for data retrieval from the sequence and structure database http://www.blueprint.org/seqhound/seqhound documentation.html 36

In e-science, a number of legacy data analysis tools are designed to be commandline applications. Soaplab 5, developed by EBI, is a SOAP-based web service utility used to wrap such command-line applications into web services. Recently, service-oriented computing middleware, capable of supporting life science experiments, have been developed. We believe that an ideal service-oriented architecture should allow service and data providers to publish their information into registries with semantically defined properties using domain ontologies; it should allow not only experts but end-users to define their workflow at a high level of abstraction using vocabulary provided in the domain ontology; allow the execution of the workflow and monitoring the workflow execution process; allow the reuse or partially reuse the existing workflow and support the data provenance management. Several workflow managment systems are developed in order to meet this goal. Discovery Net 6 is a service-oriented computing system based on an open architecture re-using common protocols and common infrastructures such as the Globus Toolkit for knowledge discovery. It is a multidisciplinary project serving application scientists from various fields including biology, combinatorial chemistry, renewable energy research and geology. The system allows service providers to publish and make data mining and data analysis software components as services. It allows data owners to provide interfaces and access to scientific databases, data stores, sensors and experimental results as services. It also allows users (scientists) to plan, manage, share and execute complex knowledge discovery and data analysis procedures. Besides re-use of the common protocols and infrastructure, Discovery Net define its own protocol DPML (Discovery Process Markup Lan- 5 http://www.ebi.ac.uk/tools/webservices/soaplab/overview 6 http://www.discovery-on-the.net/ 37

guage) for constructing and managing knowledge discovery procedures, as well as recording their history. The defined data analysis task (scientific workflow) can be executed on distributed resources, stored, shared, and re-executed. Pegasus 7 [34] [23] [2] is a framework that enables the mapping of complex scientific workflows onto the Grid. In the Pegasus system, an abstract workflow is a workflow in which the workflow activities (software components) are independent of the Grid resources used to execute the activities. The abstract workflow depicts the main steps in the scientific analysis including the data used and generated, but does not include information about the resource needed for execution. The abstract workflows can be constructed by using Chimera VDS (the GriPhyN Virtual Data System) 8 or can be written by users from a workflow template. The concrete workflow represents an executable workflow that includes details of the execution environment. It also includes the necessary data movement to stage data in and out of the computations. Other nodes in the concrete workflow also may include data publication activities, where newly derived data products are published into the Grid environment. A major focus of research on the mapping of abstract workflows to concrete workflows in the Grid computing environment is on how to find an appropriate resource currently registered at each step. Extra service components such as data transfer and data registration in the grid environment may have to be encapsulated in the workflow. This mapping process may be automated with algorithms and AI planning technologies if the resources are semantically well-described. During the mapping process, the workflow may be restructure, reordered, and refined to improve the overall performance and to 7 http://pegasus.isi.edu/ 8 http://www.ci.uchicago.edu/wiki/bin/view/vds/vdsweb/webmain 38

adapt to dynamically changing execution environments. The concrete workflow can be given to Condor s DAGMan 9 for execution. mygrid 10 is a service-oriented computing middleware for supporting life sciences researchers with the construction, execution, and sharing of scientific workflows using the Taverna 11 workbench. Researchers can use the graphic workbench to drag and drop service components into the model explorer. Recent mygrid developments focus on supporting users in the discovery and composition of services by using rich service annotations to make the workflow design more accessible to non-expert users. With incorporated semantic web technology, services and workflows can be described using domain specific ontologies. It is a valuable capability in a system potentially searching over thousands of services. Instead of locating available Grid resources, the semantic web enabled services annotation and discovery in mygrid is used to locate the software components or data that are exposed as web services. The executable workflow is written in XScufl language and executed in Freefluo workflow engine. Life sciences researchers can monitor the execution status through the Tavana workbench. In the mygrid system, the Feta data model is used to represent the semantic description of available services [50]. Web services are annotated with terms from an OWL-base mygrid domain ontology [103] with an GUI based interface Pedro [33]. This approach is more lightweight than the OWL-S and WSMO ontologies, although less expressive of details which could be more supportive for the automation process. Although the description methods adopted in mygrid has limited expressivity, they are sufficient for describing most services and their simplicity makes them more practical 9 http://www.cs.wisc.edu/condor/dagman/ 10 http://www.mygrid.org.uk 11 http://taverna.sf.net 39

for describing large number of services. IRIS [74] project is another project that targets discovery, composition, and interoperability of services required within in silico life science experiments. The IRIS project uses a semi-automatic procedure for identifying and placing customizable adapters (mediators) into workflows built by service composition. In IRIS, the capabilities of a mediator are described using the Mediator Profile Language (MPL). MPL is developed as a top-level ontology using the Web Onotology Language (OWL). BioMoby 12 is an open source research project which aims to generate an architecture for the discovery and distribution of biological data through web services [101]. Decentralized data and services are registered at a centralized registry called MOBY Central. The BioMOBY project focuses on the area of service description, discovery, transaction and simple input/output object type definitions. This foundational set of functionality allows client programs to expand the specification to include additional new features. The architecture provides a set of foundational functions that allows client programs to expand on the specification to include additional new features. There are two development tracks with different architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). The BioMoby project recently integrated access to many BioMoby features to the Taverna workbench interface using a Taverna plug-in. Users are guided through the construction of syntactically and semantically correct workflows from the graphic interface [44]. Open Middleware Infrastructure Institute UK (OMII-UK) 13 is a project that aims to provide software and support to enable the collaboration 12 http://biomoby.open-bio.org/ 13 http://www.omii.ac.uk/ 40

of building infrastructure for the UK e-science community and its international collaborators. The OMII environment integrates other open-source software components to provide users a secure web services hosting and services execution environment. Users can deploy web services on different levels in the OMII server architecture, a normal Axis web service and a secure web service with the WS- Security support. GridSAM provides a Web Service for submitting and monitoring jobs managed by a variety of Distributed Resource Managers (DRM). The modular design allows third-parties to provide submission and file-transfer plugins to GridSAM. It also integrates GRIMOIRES, a registry service, to provide descriptions of services and workflows. The GRIMOIRES implementation extends the UDDI specification and provides not only the syntactic description but also semantic descriptions. The OGSADAI middleware provides data integration and secure infrastructure for exposing data resources as web services in a grid or any other context. WSRF::Lite follows on from OGSI::Lite Perl, the Web Service Resource Framework (WSRF) which was inspired by and supersedes the Open Grid Services Infrastructure (OGSI). WSRF::Lite provides support for the following Web Service Specifications: WS-Addressing, WS-ResourceProperties, WS-ResourceLifetimes, WS-BaseFaults, WS-ServiceGroups, WS-Security. 2.5 Conclusion In this chapter, we introduce several concepts related to SOA and discuss the integration of these technologies to solve some open issues in SOA research. Applying semantic web technology is intend to automate the web service discovery and composition process with little (or without) guidance of a human being. The challenges are: 1) define a high quality domain ontology; 2) interoperability of the 41

ontology among different domains; 3) correct annotation of large numbers of web services and data using the ontology; and 4) an agreed on definition of service composibility, soundness, and scalability. AI planning technologies used in the service composition process is largely studied at the theoretical level and often demonstrated with a well-defined, small domain, such as a travel agency, instead of large real world applications. Services provided in the Grid architecture, in particular, Globus toolkits, can be exposed with a web services interface and be composed into a workflow. When combined with Grid computing technology, this allows the creation of virtual organizations and groups, provides a service-oriented architecture that is more efficient and flexible with resource allocation and data transfer (such as gftp tool), and enables an increased level of privacy inside and between virtual organizations. Since Grid computing and service-oriented architecture are converging together, there are many standards and specifications that are constantly being expanded, updated, refined, and obsolete rather rapidly, it is hard to keep up with those evolving standards and specifications. For example, the Open Grid Services Infrastructure (OGSI) was published by the Global Grid Forum (GGF) as a proposed recommendation in June 2003. It was intended to provide an infrastructure layer for the Open Grid Services Architecture (OGSA). OGSI is now obsolete, and has been superseded by Web Services Resource Framework (WSRF). With the release of GT4, the open source tool kit is migrating back to a pure Web services implementation (rather than OGSI), via integration of the WSRF. Applying peer-to-peer technology can help to avoid central failure and increase the scalability during the service discovery and workflow execution process. Service-oriented computing is a new research area, with many in-progress 42

frameworks and middleware, workflow specification, WS-* standards, and ontological representations that have been presented without complete tool support. There are still many areas of research need to be addressed in order to build a complete, reliable, and ideal service-oriented architecture. 43

CHAPTER 3 A SERVICE-ORIENTED DATA INTEGRATION AND ANALYSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH In this chapter 1, we present a practical experiment of building a serviceoriented system upon current web services technologies and bioinformatics middleware. The system allows scientists to extract data from heterogeneous data sources and generate phylogenetic comparisons automatically. This can be difficult to accomplish using manual search tools since sequence data is rapidly accumulating and those manual tools will need to be repeatedly invoked as that new data becomes available. A web-based environment enables scientists to more effectively define a task, perform the task at a desired time, monitor the execution status, and view the results. The first prototype of this system is evaluated on a phylogenetic research application, Mother of Green (MoG). Our evaluation demonstrates that a service-oriented architecture can accelerate scientific research, increase research productivity, and provide a new approach to doing science. We also discuss issues in design and implementation of the system and identify our future research directions to enhance the system. 1 Portions of this chapter appear in the 40th Annual Hawaii International Conference on System Sciences, HICSS40, Hawaii, 2007[110] 44

3.1 Introduction As biological research is becoming increasingly data driven, scientists are conducting experiments using the cyberinfrastructure (in silico experiments) to gather information in public online databases and to test their hypotheses. These heterogeneous, independently developed data sources make traditional approaches insufficient for this type of research and experimentation. Complex queries against several of these databases may provide valuable new insights, but interoperability problems make this difficult. The researcher must often manually cut and paste data from one database resource to another and repeatedly use multiple tools to format and analyze the data, a process that may take days or weeks. In many investigations, the process stops once the scientist requires a workflow that is not feasible using manual retrieval and analysis. There is a demand for a methodology that frees users from having to locate the data sources, interact with each data source, and manually combine data in multiple formats from multiple sources. A promising solution to achieve the seamless interoperability among these data sources and analysis tools relies on the emerging technology of service-oriented architecture (SOA). SOA has been recognized during the past few years as an approach to achieve interoperability among multiple data sources [91] [92]. Many large bioinformatics database providers, such as NCBI, EMBL, DDBJ, already make their databases available via a SOA. Emerging toolkits and platforms, such as Soaplab [87] enable many data analysis tools to be wrapped as web services. These existing services permit software engineers to build unified interfaces for scientists to access heterogeneous data sources. The platform independent feature of SOA makes it a feasible solution to integrate increasingly available data analysis tools. 45

While there are protocols, toolkits, and middleware that are increasingly available to address the majority of the technical issues in building a data integration and data analysis environment, the question of how real world problems can be solved successfully using these technologies needs to be answered through practical implementations in a real world context. In this chapter, we describe the design and implementation of a web-based data integration and analysis environment. The underlying infrastructure is built upon current web service technologies and bioinformatics middleware to enable biologists to better utilize heterogeneous genomic data. The first prototype of the system is used in a phylogenetic research application, the Mother of Green (MoG). MoG is a collaborative research project on plastid phylogenetic analysis involving information technologists and biologists. Genomic sequence data is accumulating faster than scientists can find and analyze it using manual search tools. The SOA-based platform allows scientists to extract data and analyze phylogenetic comparisons automatically. The web-based environment enables scientists to more effectively define a task, perform the task at a desired time, monitor the execution status, and view the results. The overall aim of this project is to provide an easy-to-use environment for biologists to research the puzzle of plastid phylogeny and to answer an open question on the phylogenetic history of the plastid genome. In the rest of this chapter, we briefly review web service technologies and related work followed by an overview of the MoG project and a description of the overall system architecture. We then describe a prototype implementation of the system, related issues, and extensions of the system. 46

3.2 Related work The service-oriented architecture (SOA) was proposed initially as an emerging paradigm for business process integration inside or across organization boundaries. It is gaining significant attention from the scientific research community for use in building e-science infrastructures. The proposed standard in grid computing, Open Grid Service Architecture (OGSA) [63], is built upon service-oriented architecture and demonstrates the convergence of the Grid with SOA. Three basic standards in SOA, Simple Object Access Protocol (SOAP), Web Services Description Language (WSDL), and Universal Description, Discovery and Integration (UDDI), are sufficient for providing simple atomic services. However, single atomic services are not adequate for developing complex applications. One of the most important features of SOA is that services developed in different groups can be combined as a workflow to solve complicated problems. This feature leads to several research issues and challenges including service discovery, services composition, and service enactment. Semantic web technology[54] [7] and peer-to-peer technology are used in SOA to automate the service discovery process and make the service enactment more reliable. BioMOBY is an open source research project which aims to generate an architecture for the discovery and distribution of biological data through web services [101]. Decentralized data and services are registered at a centralized registry called MOBY Central. The BioMOBY project focuses on the area of service description, discovery, transaction and simple input/output object type definitions. This foundational set of functionality allows client programs to expand on the specification to include additional new features. The architecture provides a set of foundational functions that allows client programs to expand on the specification 47

to include additional new features. There are two development tracks with different architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). REMORA [14] is a web server implementation base on the BioMOBY service specification. It provides life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system. Another project, mygrid, provides e-science application developers a toolkit based upon a high-level middleware layer. It builds on and extends the Grid framework of distributed computing through a SOA. It not only provides a semantic based service discovery system but also the Taverna workflow bench [65], personalized data repositories, provenance and update notification. The direct users of mygrid are users who build applications using the mygrid toolkit [94]. Compared to the BioMOBY project, mygrid has more ambitious goals. Bioinformaticians, tool builders and service providers can collectively or selectively employ these middleware services to produce applications that support research in the biological and life sciences [36]. The IRIS [74] project is another active project that targets the service discovery, composition, and interoperability of services required within in silico experiments. The IRIS project handles this problem through a semi-automatic procedure for identifying and placing customizable adapters into workflows built by service composition. Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface (BioWBI) [9] are two projects provided by IBM aphaworks to allow life science researchers to build and execute bioinformatics workflows and share their analysis processes. WsBAW is an application that automates bioinformatic workflow by deploying 48

a web service. BioWBI is an easy-to-use, Web-based working environment from which a life sciences researcher and/or a research community can build and execute bioinformatic workflows and share their analysis processes. IBM alphaworks provides the applications, Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface (BioWBI) [9]. WsBAW is an application that automates bioinformatic workflow by deploying a web service. It consists of a client application through which users are able to send batch requests to a specific bioinformatic workflow execution engine, such as BioWBI, by using a Web service. BioWBI is an easy-to-use Web-based working environment from which a Life Sciences researcher and/or a research community can build and execute bioinformatic workflows and share their analysis processes. 3.3 Motivation The motivating application is the phylogenomics of the plastid. Named the Mother of Green (MoG) project by an multidisciplinary team of computer scientists and biologists, MoG aims to identify the most recent common ancestor of all plastids. While many biologists support the view that all plastids are descended from a single endosymbiont ancestor, the data are not conclusive due to missing information and inefficient use of existing information. Using the nucleotide and amino sequences of expressed genes to infer ancient ancestral relationships, MoG investigators hope to identify which of the ancestral plastid genes have traveled into the host nucleus and why some genes are more likely to be transferred than others. The rate of data accumulation, the rapid development of new phylogenetic analysis tools, and the refinement of existing tools simply overwhelm the 49

researchers. The biologists need a better approach than manual or ad-hoc scripting to accumulate and analyze enough relevant data to rigorously test the single ancestor hypothesis. 3.3.1 Use case A typical phylogenetic analysis process consisting of multiple manual data collection and data analysis steps is described below and shown in Figure 3.1. A typical in-silico investigationdata driven research workflow A: A: Query complete genome sequences given a taxon B: B: Query protein coding genes for for each genome sequence E: E: Phylogenetic analysis D: D: Sequences alignment C: C: Eliminate vector sequences 2007-4-18 Ph.D defense 12 Figure 3.1. A manual phylogenetic data collection and data analysis process A) Biologists send a query to a data provider, NCBI for example, through a web-based interface to retrieve the whole genome sequence of a specified taxon. After recording the query terms and results, the investigator must examine the list of sequences, delete inappropriate entries and then add new entries based on their knowledge of plastid phylogenomics or from sequences generated in their own lab. 50

B) For each whole genome sequence, biologists need to find specific protein coding genes, or the specific subunits of protein coding genes, or specific active sites within a specific gene or subunit. This is an iterative process for each entry in the list. C) Each nucleotide sequence must be checked for vector sequences, a common contaminant of nucleotide sequences in unvetted public databases, and any detected vector contaminants removed. D) Biologists then choose a subset of these genes and use a sequence alignment program, (e.g. ClustalW), to align the sequences. After viewing the results, biologists may decide to choose another subset for sequence alignment analysis or continue the comparison using phylogenetic tree building tools. E) Once the initial sequence alignment results prove satisfactory, biologists convert the alignment output to the appropriate data format required by the phylogenetic analysis programs, such as PAUP or Phylip. 3.3.2 Operational barriers The data retrieval and data analysis processes need to be repeated multiple times, as different hypothesis are evaluated and new data pours into the public databases. From an operational perspective, this repetition makes the research process time consuming or even impossible using manual approaches. Other barriers also make this particular scientific research process even more difficult. Data collection The capabilities offered by a data retrieval system cannot always meet the requirements of scientists. Entrez [61] is a web-based data retrieval system available from NCBI that provides integrated access to multiple databases covering a variety of data domains, including com- 51

plete genomes, nucleotide and protein sequences, gene sequences, threedimensional molecular structures, literature, and more. However, sometimes scientists are not able to get desired information with a simple query. For instance, find all of the subunits for the plastid ATP synthase requires that the investigator first identify the official protein names of all subunits of which there many (atp alpha, atp beta, atp gamma, atp delta, atp epsilon and so on) for the plastid-specific ATP synthase. The next process is to retrieve these sequences for each new genome and to merge these data with the data previously retrieved. Analysis tool usage Each data analysis program may have different requirements for input data formats even for programs providing similar functionalities. Correct use of these programs and correct implementation of this workflow relies heavily on the researcher having detailed knowledge and understanding of each tool. A typical work unit might be: find all of the sequences for atp synthase alpha subunit that are most similar to the atp alpha synthase sequence found in Prochlorococcus, align the sequence using clustalw, save that output, then reformat the data and submit the sequences to Phylip for phylogenetic analysis. The output from one data analysis program needs to be fed into the next program as its input with appropriate conversion to the required data format. The rapid development of new data analysis tools and the refinement of existing tools make the manual data conversion process even more difficult. Experimental record keeping Accurate recording of an in silico investigation, including materials, methods, and results is as important as accurate recording of bench top or field experiments. Keeping the provenance data, includ- 52

ing the input, output, and intermediate data sets is also critical. Manual organization of these metadata quickly approaches impossibility for anything but the most trivial of queries. An easy-to-use environment is essential and necessary to support the automation of deep phylogenetic analysis. For many years the data were sparse. Now mountains of data exist but our limited 20th Century tools do not properly equip us to mine for the gems within them. Automation has become necessary. 3.4 System architecture The whole system, the MoGServ, includes an underlying infrastructure, MoGServ middle layer, and a web-based environment that provides an easy-to-use interface for scientists to access functions provided by the middlelayer. The system acts as both service consumer and service provider in the context of SOA. While it consumes and aggregates services provided by other service providers, the system also provides services that can be used and integrated by other applications. There are two roles in the design and implementation of the system, endusers and software developers. End-users are biologists who focus on the study of what information needs to be gathered and what data analysis needs to be preformed. The software developers are responsible for several tasks based on endusers requirements: collecting and annotating available services; creating services to implement functions in the specific application; building workflows to automate a variety of tasks required by end-users; providing a flexible, high performance, fault-tolerant infrastructure to execute the workflows; providing a mechanism for end-users to keep track of the origin of the data (data provenance); and providing end-users a web interface to configure a task, monitor the execution status, and 53

view results. An overview of the MoGServ system architecture is given in Figure 3.2. MoGServ System Architecture Web Interface Data Access Services Data Analysis Services NCBI Application Server Local Data Storage Job Manager Job Launcher DDBJ EMBL Applications Service/Workflow Registry Metadata Search Workflow/Soap Engines Services Others Services Access Client MoGServ Middle Layer Data/Services Providers Figure 3.2. MoGServ System architecture includes a services access client, MoGServ middle layer, and other data and services providers 3.4.1 Data storage and access service Data collection from multiple distributed data resources is one of the first steps of a bioinformatics research project. In the MoG project, an in silico experiment involves the collection of large data sets, a computational and memory intensive process that involves daily checking for new information and quality control for each new sequence detected. Some data service providers limit the number of connections to their data server for performance concerns. The refresh rate of the 54

data in a data source is much lower than the rate of end-user requests for the data. Therefore, a local data storage is required to store biological data collected from remote data providers, to avoid repeated vetting of the same data and to insure access to the data for time sensitive projects. The biological data from remote data sources is gathered, aggregated, and integrated into the local database through a set of data access services. An in silico experiment also requires the integration of results from numerous data analysis tools. Recording the intermediate data in the local database allows MoGServ to preserve the data provenance and provides opportunity for end-users to keep track of where a piece of data has come from. The information stored in the local database can be accessed through a set of data access services. 3.4.2 Service and workflow registry A service and workflow registry provides a repository to store descriptions of services and workflows that may be used in a phylogenetic study. These services and workflows include both locally constructed and preexisting services. The registry also provides functions to allow inquiries about services or workflows. In the first prototype, neither UDDI-based registry nor semantic-based descriptions are employed. While a UDDI type registry is more business-oriented and may not be a perfect fit for this application, the semantic-based description takes more time to define a commonly used ontology. The current registry is a simple table with focus on capturing both functional and non-functional properties of services and workflows to support service selections, service and workflow enactment, and provenance data representation. Semantic-based description and inquiry provides the attractive capability of automating service discovery and will be used in the 55

TABLE 3.1 ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIPTION Attributes id name text description location input/output provider version algorithm invocation method Description a unique sequence number assigned to the service/workflow during the registration process the name of the service or workflow description of the functions provided by the service or workflow the URL of the definition of the workflow or WSDL location of the service description of input/output parameters the name of the service or workflow provider the version of the service or workflow implementation the algorithm used in the service or workflow implementation the method used to execute the service or workflow next version of MoGServ. The description of a service or workflow includes attributes as shown in Table 3.1. When end-users view results from their experiments, they may ask a question which algorithm was used to generated the data and what is the source of the data? Service consumers may prefer a service or workflow based on their preference for a particular algorithm or provider. For example, a sequence alignment service can be implemented using the Sequence Alignment and Modeling System (SAM) or ClustalW. 3.4.3 Indexing and querying metadata The data is best managed with a relational database; however, for searching purposes, an indexer is more efficient. We identify and extract metadata about 56

additional actual data sequences, experiments, services and workflow descriptions in the local database. For example, the metadata of a gene sequence includes gid, accession number, name of the sequence, from which organism, and taxonomy. An experiment can generate results that leads to new or more detailed information requirements and a new series of experiments. End-users may need to know the origin of a piece of data which query was used to get this subset of sequence, when was the data generated, what process was used to generate the results. This may lead to new experiments using different data sets or even different methods. These metadata are extracted and indexed by a metadata indexing service. This service is triggered when new data is added into the database. A metadata searching service provides functions to query an index. 3.4.4 Service and workflow enactment The system supports both synchronized invocation and asynchronized invocation methods. Synchronized invocation is mostly used for invocation of services or workflows with short running times, e.g. querying sequence data or job information in a local database. Asynchronized invocation is used for executing long running services and workflows. As shown in Figure 3.3, the job manager accepts the input parameter of the service/workflow, service/workflow id, and timer. The definition of the services and workflows is found in the registry. A job definition including the services or workflows URL, input parameter, timer, and other metadata of the job information (such as when and who submitted this job) is stored in the database. A job id used to identify the job is generated. The job launcher periodically checks the database to retrieve a service and workflow which needs to be executed at a time 57

point. Multiple workflow engines are deployed on different nodes to prevent single engine failure and achieve higher performance. A similar mechanism is used for deploying long running services to prevent service failure. Each node hosts a service that is responsible for returning the current load information of the node. This information is used by the job launcher to dispatch a job to an optimal node. With the SOA, it is easy to distribute and invoke workflows and services remotely. The execution status of the workflow or service is recorded into the database as an attribute of a job description. This information can be used for implementing failure recovery functions, such as restart. The job information accessed through data access services allows end-users to monitor the execution and view the results. Service and workflow enactment Service/Workflow Registry INPUT Find the service/workflow definition using the task name Parameters Task Name Job Manager Form a Job Description Job Information Job Launcher Timer Output Job ID Instances of of Workflow/Service Engines 2007-4-18 Ph.D defense 18 Figure 3.3. Asynchronized services and workflow invocation model 58

3.5 Implementation 3.5.1 Development and deployment tools Among a large number of programming platforms for web services development and deployment, Microsoft s.net and Sun s J2EE typically are two main choices for applications and middleware developers. With consideration of future extensions of the system as well as our previous experience with Java, the J2EE based platform appeared more suitable for MoGServ. In particular, Apache s open source tools - Tomcat(5.0.18) and Axis(1 2RC2), are used. Tomcat/Axis are active projects with support from the open source community. Another open source software tool, Eclipse, is used to develop the web interface for the system. There are more than a dozen proposed languages to coordinate messaging and transactions among independent web services. The business process execution language for web services (BPEL4WS) is a promising workflow language since it has wide support from IBM, Microsoft, and BEA. Several workflow enactment engines, such as BPWS4J, Collax, ActiveBPEL, are already in place to support the execution of workflow. While a business-oriented workflow language and corresponding execution engine can be used in the scientific domain [20], the Taverna [65] project possesses more attractive features and naturally fits the development of our system. The Taverna project is open source and a part of the mygrid project developed in the e-science community to support data-intensive in silico bioinformatics experiments. The Taverna workbench provides a graphical tool for building, editing, and browsing workflows and generates a XML-based Simple conceptual unified flow language (Scufl) document. The embedded workflow execution engine, Freefluo, facilitates testing during the development process. 59

Freefluo, a Java workflow enactment engine, which supports the Scufl specification, coordinates execution of the parallel and sequential activities in the workflow and supports data iteration and nested workflows. The enactor can invoke arbitrary WSDL type service operations as well as more specific bioinformatics service operations such as Soaplab and BioMoby. Apache Lucene [51] is used in our system for building a search engine to support full-text search on sequence data, intermediate data results, and job information stored in the local database. Since Lucene is a search engine library written entirely in Java instead of a command line toolkit, it provides flexibility to write a variety of applications with rich search capabilities. These capabilities include ranked searching, phrase queries, wildcard queries, proximity queries, fielded searching, and so on. PostgreSQL(8.0) is used to store all the intermediate data results, job information, sequence data, and services/workflow descriptions. 3.5.2 Services provision We create web services using the RPC style due to its easy implementation with full support from most tools. As most bioinformatics applications take a number of input parameters and produce a number of outputs, we use an XML document to represent the input/output of a service for which a large number of parameters are needed. The XML document is provided as a single input parameter to the service or workflow and the output results are produced as a single XML document. Using this method, the service consumers themselves create a valid and accurate XML document for input while service providers parse the XML and extract the input parameters. 60

Multiple services are created and deployed on the Tomcat/Axis server using the Java2WSDL and WSDL2Java toolkits. Individual services can be invoked statically or dynamically through a client side application. They can also be used as a building block in the workflow creation process. We separated services provided in the first prototype into the following categories. Data collection The original data source is NCBI. NCBI s Entrez Programming Utilities (eutils) provide access to Entrez data outside of the regular web query interface and help for retrieving search results for future use in other environments. With the eutils SOAP interface, we create services to get data, such as complete genome sequences and specific genes of interest. Query local database All the intermediate data and job information are stored in the local database to help biologists keep track of the data provenance and monitor the job execution. Also in this particular application, biologists are interested in selecting sequence subsets from the local database and using sequence alignment services to do preliminary comparisons. A set of services are implemented to query desired information. Indexing and querying metadata The creation and update of each of these indices is done by a service operation. The index service is triggered whenever new data is stored in the database. The query service accepts a query string and an index name to search the index and return output. Data format services Each particular data analysis tool used in a bioinformatics study requires a specific data format as input. A set of data format services in the system is implemented to convert data into an appropriate format. This type of service can be used in a workflow creation process or used explicitly. Data analysis services Many existing data analysis tools in bioinformatics re- 61

search are available as command line applications. The creation of a data analysis service is a process to wrap these toolkits as web services. JLaunch [42] is a lightweight Java library for launching command line applications from Java programs. With the JLaunch library, we can write Java programs to execute any type of command line programs. 3.5.3 Workflow engine The Freefluo workflow engine is deployed on a application server. The invocation of the workflow engine is done by generating a local stub specific to the Freefluo web services API. The local stub is implemented as part of the job luncher in our system. The execution of a workflow on the Freefluo engine follows the following steps: 1) obtain a proxy to the remote Freefluo server; 2) create a Scufl model; 3) pass a XScufl workflow to the Scufl model and form the input using the Baclava data model, a representation of Taverna data type 2 ; 4) compile the XScufl workflow as a workflow instance; 5) execute the workflow instance and obtain an ID from the server; 6) poll the Freefluo engine until the execution has completed; 7) retrieve a list of outputs from the server; 8) extract the required output from the Baclava data model; and 9) destroy the workflow instance. 3.5.4 Building workflows A Scufl workflow represents a procedure as a set of processes and the relationships between these processes. Our workflow design uses available services as building blocks whenever possible and creates new ones when necessary. The 2 http://taverna.sourceforge.net/index.php?doc=usingbaclava.html 62

Taverna workbench provides a graphical tool to build and test workflow as well as a number of integrated bioinformatics services. The Scufl language has some useful features such as implicit iteration and conditional branching that are most important for building workflows in this application. During the construction of workflows, we often encounter the case that output of one service can not be completely fit to the input of the next chosen service. One approach we take is to create a new service, such as the type of data format service described above, and expose it in the same way as other services. An alternative approach provided in the Taverna workbench is to use the Beanshell scripts [4] to convert the output to appropriate input. We create a number of workflows using the Taverna workbench to support the research. One example is shown in Figure 3.4. It is a workflow used to retrieve a complete genome sequence and particular gene sequences from the NCBI site. The workflow accepts two inputs, the query term and the particular gene group. The service genome gids by terms returns a String of gids and a Beanshell script converts the String to a list of gids. The service Get Nucleotide Fasta, a third party service, accepts a gid and returns a sequence in fasta format. The implicit iteration method in the Xscufl workflow enables iteration for all the gids in the list. With the service-oriented architecture, the same services can be used for different workflows, minimizing the need to create new services. 3.5.5 Web interface The web interface provides scientists a convenient interface to configure their tasks, monitor the job execution status and view results. It is implemented with a number of server side JSPs (Java Server Pages). The returned results 63

are transformed with appropriate XSLT to HTML pages. The service-oriented architecture provides flexibility of building the front-end web application with different languages, e.g. Perl, and deploying on a different web service engine, e.g. Apache/SOAP::lite. 3.6 Discussion Although current development and deployment tools haven t implemented all the features claimed in the service-oriented architecture specification, they are actively evolving to make it happen. In particular, the Apache Tomcat/Axis, Taverna workbench, and Freefluo engine enabled the implementation of our first prototype. In general, SOA offers considerable benefits for building the system: 1) The loosely coupled feature of SOA facilitates the distribution of computational intensive processes across multiple nodes; 2) The platform independent feature of SOA facilitates the integration of data from heterogeneous data resources through distributed web services; 3) The composition-of-services feature allows reuse of a service in multiple workflows minimizing the need to create new services; and 4) SOA also provides flexibility for building the front-end web application with different languages, e.g. Perl, and deploying on different web service, e.g. Apache/SOAP::lite. While we believe a simple SOA architecture is appropriate in the design and implementation of our system, there are various aspects of the system that need to be improved. We summarize issues and the directions to enhance the system in this section. 64

3.6.1 Issues with the first prototype Security Although security was not our major concern during the first prototype implementation, it is an important component in the next implementation. Services and workflows provided in the system allow users to access the computational and data resources in the system with no restrictions. A certain level of security is required to prevent abuse of the system and to protect sensitive data and analysis results. An authorization component should be built in the system to enable users to access the permitted services and to personalize their own workspace. A web portal will be built to enable users to create an account, login and logout with username and password. The user account information including the access level will be stored in a database. The GridSphere portal framework [39], an open-source portlet based web portal, is one of the candidates. Service and workflow description and selection In the first prototype implementation, the same development group acts as both service provider (services/workflow creation) and service consumer (building the web-based application using these services and workflows) roles. While there is no demand for supporting the selection of appropriate services/workflows, the major capability of the index-based services/workflows registry is to keep track of data provenance and to provide definition for performing services/workflows. However, the index-based syntactic description services/workflows provide limited flexibility for third party service consumers to choose appropriate services/workflows provided in the system and to integrate them into their application without prior knowledge. Failure tolerance and recovery The workflow or service execution may fail at some point due to the failure of the enactment engine, failure of the service, 65

and failure of the network fabric [64]. Our system handles these failures during the static workflow design stage and services or workflows invocation stage. Multiple workflow engines and long running services are deployed on different physical locations. It allows a submitted task to be invoked on the most idle site to achieve higher performance. More importantly, this approach can prevent dispatching services/workflows to the engine with a physical failure. Recording execution status of long running services/workflows in the database allows us to add policies for determining if a failed service/workflow should be restarted. The Taverna workbench and Xscufl provide a capability that allows users to specify an alternate service and to configure basic fault tolerance mechanisms during the workflow design stage, which can prevent the failure of services to a certain degree. Another more promising, yet more complicated approach for failure recovery is to support the dynamic selection of alternate services during execution time. However, the implementation of this feature requires services to be described in rich semantic formats using a widely accepted ontology. Data provenance In the system, the metadata description of sequence, job information, and services/workflows are stored in the database. A set of indexing and querying services allows end-users to trace the origin of the data, which is a desired feature for scientists. Also, the workflow engine and Xscufl provides mechanisms to record more detailed information including the type of processor, status, start and end time, and a description of the service operation. A systems administrator may be interested in using this information to investigate how results, in particular erroneous or unexpected ones, were produced by workflow processes. 66

3.6.2 Extension of the system Although the first prototype of the system focuses on design and implementation based on relatively mature technologies in service-oriented architecture, we are extending the system to address some issues described above with grid computing and semantic web technologies. Grid technologies specify the mechanisms for distributed resource management, coordinated fail-over, and security. As the Grid technologies, and Grid framework Globus toolkit [97] in particular, are evolving towards the OGSA standard, integration of the Grid technologies into the system can help address some issues discussed above. The convergence of service-oriented architecture and Grid technology allow us to enhance the system through the integration of existing components. In a scientific domain, the process used to generate the output of a service and workflow is often as important as the result. As is the case with bench scientists, in silico investigators will decide for themselves which methods and which data will be used for their study as well as what kind of outputs they are expecting. In the first prototype implementation, this requirement is satisfied through close collaboration among team members. As this system will be used by a phylogenomics research community that spans multiple disciplines, different investigators will have their own methods for approaching problems of common interest. A mechanism that allows end-users to define the workflow at a higher level of abstraction is required. Instead of choosing specific services to form a workflow, scientists would rather define a workflow by specifying functions that a service should provide. Different levels of training and experience also require different levels of abstraction. For example, a 67

graduate student in a particular research domain may have limited knowledge of the methods available to perform an experiment, while an experienced investigator may know ahead of time which building blocks are required and which approach is most efficient for the scientific hypothesis to be tested. We represent different abstraction levels in Figure 3.6. End-users may need to define the workflow at any one of these four stages based on their knowledge of provided services. A concrete workflow, which can be sent to a workflow engine, is represented at the fourth phase. The conversion from the third phase to the fourth phase is related to choosing an instance of a service with Quality of Service (QoS) metrics. One service interface may have multiple implementations provided by different service providers. These implementations have different quality properties such as trustworthiness, cost, execution time, and so on. An optimal service should be chosen during this conversion process. The conversion from the second phase to the third phase requires mapping a particular task to a service, or a sequence of multiple services. This mapping process can be accomplished manually by software developers in an ad-hoc way, like the approach we took in the implementation of the first prototype. This approach relies heavily on developers knowledge of services and logical ordering in the workflow. Preferably, this process should be able to be done partially or wholly automatically. In order to support this semi-automatic or automatic process, a complete presentation of knowledge should be in place to allow software agents to substitute the work of the human. Using semantic web technology, in particular OWL and OWL-S, to represent the ontological representation of domain knowledge and semantic description of services is a promising approach. Semantic web technol- 68

ogy offers promising features for supporting bioinformatics research [12]. Some bioinformatics middleware, such as the mygrid and BioMoby projects, have their own approaches to support automated discovery and composition of services using semantic web technology [49]. Much research has been done exploring AI planning techniques for automation of the composition process. The long term goal of a successful composition mechanism should meet several requirements: connectivity, quality of service, correctness, and scalability [58]. Although there are still practical difficulties in developing semantic web services, we believe that the appearance of tools for creating ontologies, annotating services [89], and development of widely accepted domain ontologies allow us to add semantics into our system and support the automation of the mapping process. 3.7 Conclusion As both data and tool providers begin to present their resources with web services interfaces, and as open source tools and middleware for supporting web services, workflow generation, and enactment become more available, biologists will begin to use those available services, as well as begin to provide service access to their databases and programs for sharing within the bioinformatic community [65]. Our system is a demonstration of progress toward this goal. In summary, current SOA standards and toolkits are sufficient to build the first prototype of MoGServ. MoGServ is in its early stage of development with limited services and workflows available. The basic implemented functionalities enable the user to collect data and do preliminary data analysis as well as metadata searching. By using the system, scientists are able to get some scientific insights about the 69

alpha subunit of ATP synthase and indicate that it retains the signal of a very ancient line of descent while having enough polymorphism to infer phylogenetic relationships [78]. Building the system upon the SOA provides us flexibilities to integrate services, to build a variety of workflows, and to build a web portal for scientists to access the system via a web interface. New features and services are continuously being added to the system in response to scientists feedback and requirements. The future direction of our research will be to focus on enhancing the system using semantic web and grid computing technologies. 70

Figure 3.4. A workflow built using Taverna workbench to get complete genome sequences and specific gene sequences 71

Figure 3.5. A workflow for querying two subset sequences from local database, filtering out sequences coming from same organism, and doing sequence alignment analysis Figure 3.6. Abstraction of user defined workflows 72

CHAPTER 4 EXPLORING THE DEEP PHYLOGENY OF THE PLASTIDS WITH THE MOGSERV In this chapter, we illustrate a research application that uses the MoGServ to investigate the deep phylogeny of the plastids and attempts to answer an open question on phylogenetic history of the plastid genome. 4.1 Introduction Plastids are important organelles found only in plants and algae. Chloroplasts are the photosynthetic form of a plastid. Similar to mitochondria, both of them have their own DNA and are involved in energy metabolism. Other forms of a plastid may be responsible for storage of products like starch and for the synthesis of many classes of molecules such as fatty acids which are needed as cellular building blocks and/or for the functioning of the plant. Phylogenetics is the study of the evolutionary relationship among various groups of organisms. The origin and evolution of a group of organisms is called phylogeny or phylogenesis. The endosymboint hypothesis suggest that mitochondria were free living bacteria that were engulfed and subsequently enslaved by a primitive ancestor of all living eukaryotes [27] [69]. Between 1.2 and 1.5 Ga (billion years ago), one 73

or more of these early eukaryotic cell lineages captured a cyanobacterium and produced three primary plastid lineages: green plant lineage (chlorophytes), red algal lineage (rhodophytes), and glaucophyte lineage (a group of freshwater algae) [69]. Surviving endosymbiotes include the green algal and red algal photosynthetic chloroplasts and the cyanelle, the endosymbiont in the glaucophytes that retains more of the character of a cyanobacterial progenitor. Plastids have also spread by secondary endosymbiosis, in which a cell engulfs a cell already containing an endosymbioint. In secondary endosymbiosis, the nuclear genome of the engulfed cell usually disappears. Seven lineages are produced from green algae and red algae in secondary symbiosis [69], see Figure A.1. The evolution of these secondary plastids suffers the reducing of their genomes by gene transfer into the nucleus [77]. The red algal lineage also includes organisms that have lost the capacity to photosynthesize but sill retain a degenerate plastid. Apicomplexans produced from the red algal lineage are non-photosynthetic intracellular parasites whose members include Toxoplasma gondii and Plasmodium falciparum. Plasmodium falciparum (P. falciparum) is a protozoan parasite, one type of apicomplexa, which cause malaria in humans. P. falciparum has three genomes: nuclear, mitochondrial, and plastid (apicoplast). Phylogenetic analysis of plastid genes provides a new way for targeted antiparasitic drug design [31]. Organisms generally inherit genes from their parents (Vertical Gene Transfer), or receive genes from other organisms through Horizontal Gene Transfer (HGT) and Lateral Gene Transfer (LGT). Most plastid genomes are circular, do not recombine and are inherited through only one parent. The highly conservative character of the plastid genome makes phylogenetic analysis possible. However, the HGT and the LGT in multiple endosymbiotic events complicate the phylogeny 74

of plastids. Another complication is gene duplication and loss within the plastid itself. While there is a broad consensus that all plastids are descended from a single endosymbiont ancestor, some researchers also suggest an alternative hypothesis of multiple origins that is at least equally consistent in most cases [95]. Plastid phylogenetic analysis must account for multiple endosymbiotic events, superimposed upon a process of LGT that occurs throughout the process of converting a free-living cell to an endosymbiont. Accumulating and analyzing enough data to rigorously test the single ancestor hypothesis is a promising research direction to take. The development of advanced sequencing techniques makes a large amount of DNA and amino acid sequences available for phylogenetic analysis. The commonly used methods for inferring phylogenies include parsimony, maximum likelihood, and Bayesian inference. The rate of data accumulation, the rapid development of new phylogenetic analysis tools, and the refinement of existing tools, however, make manual collecting and analyzing these sequences difficult. For example, the number of cyanobacteria sequences in NCBI database increase from 42 to 57 within about 6 months (June 2006 - December 2006). Figure 4.1 shows the growth of sequences databases in last a few years [57]. In this chapter, we describe a scientific application that uses the cyberinfrastructure to collect and analyze data to gain biological meaning from this data analysis. The use of a web-based system, MoGServ, shows the ability to significantly increase a scientist s productivity over using a manual process. 75

Figure 4.1. The growth of sequence databases (NCBI Genebank and EBI Swissprot) and annotations. This figure is from Folker Meyer[57] 4.2 System and methods MoGServ is a service-oriented environment described in detail in Chapter 3. It facilitates scientific research and discovery from several aspects: Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management. Painless reformatting of the extracted data for commonly used analytical tools. Preliminary data inspection and analysis using these tools within the webservices environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the 76

chosen gene for deep phylogenetic analysis. User-specified additions to the local database which allows the upload of sequences into the local database. User-specified additions to the automated queries which provides a free-text searching interface for constructing data sets with interests. Deep phylogenetic analyses are highly context-dependent. The addition of a single new cyanobacterial or algal genome would fundamentally change the result. MoGServ permits an investigator to address these hypotheses using the most current data available and rapidly reanalyze data as more genomes and genes are sequenced. This enables rapid hypothesis testing and creates an environment in which genuine discovery is possible. The most exciting form of discovery is the surprise result, the result that leads to an entirely new hypothesis. 4.2.1 Data model As web services technologies have been used by several large data source provider, such as DDBJ, EMBL, and NCBI, to leverage their data and computational services, accessing up-to-date sequences becomes more flexible and feasible. Due to the nature of accessing data sources via the Internet, however, the requirement of data retrieval on-the-fly efficiently and reliably can not be fulfilled easily. Additional requirements of data manipulation and information management cannot be done without local database support. Also, the incomplete and misannotation in biological data requires biologists expertise to ensure the accuracy before using this data for analysis. In order to provide the capability of storing, integrating, and accessing sequences from diverse data sources, a data model needs to be developed to meet 77

the following requirements: Store sequences from distributed data sources and provide general annotation to facilitate querying. Ensure the integrity of sequences during the data collection process and the efficiency of updating the database periodically. Provide an easy way for scientist to manipulate their data sets and manage their scientific experiments records and data provenance. The custom data model of MoGServ consists of four modules: sequence module, set module, user module, and job module. Figure 4.2 shows the entity-relationship (ER) diagram. An alternative data model, the Chado database schema, one of the components of GMOD 1, is the foundation of interoperatability of GMOD applications. We did not use the Chado database schema because it contains large number of modules and tables that are not necessary for our system; also it does not model some of the information we are trying to capture in this system. A sequence in the system is a biological sequence that comes either from a public data resource or from laboratory experiments uploaded by scientists. A sequence can be a nucleotide sequence or protein sequence. It can also be a complete genome sequence or a gene product sequence. Each sequence can be classified using a taxonomy defined by the public database and terms that scientists used to find this sequence. A set is a group of sequences that scientists put together to support their research interests and usually is used for subsequent data analysis. The properties 1 Generic Model Organism Database http://www.gmod.org/, an open source project to develop a set of software for creating and administering a model organism database 78

of a set not only contain the sequences in the set but also the provenance of the set. For example, a set may be created by users querying the local database, or generated from the previous data analysis. A job is defined as a task involving data collection or data analysis. It contains input, output, execution status, and other properties. 4.2.2 Services The MogServ provides a number of services to support deep phylogenetic research. These services are integrated in the system and accessible from a web interface. These services can also be used as a component to be integrated into a workflow. 4.2.3 Data collection Data collection is a suit of services used to retrieve desired data from public data providers into the local database. The data collection service updates the local database periodically. Users can define the query term using any NCBI query format from the web interface provided in the MoGServ system. Users can also define a particular gene name or gene products as shown in Figure B.2. The retrieved data is indexed using the Lucene indexer and search engine [51], which supports the free text search. The syntax of search is shown as Figure D.4. The data collection suit consists of 5 components: retrieve genome sequence, retrieve gene sequence, convert genome sequence to file, convert name sequences, and index sequences. When combined with the database model, these services ensure data integrity, data accuracy, data consistent, and exception handling. Data integrity: Since users are allowed to use any appropriate query term to 79

search the public database (NCBI), it is highly possible that the same sequence would be retrieved with different query terms. Also, the classification and taxonomy of sequences in the public data sources also bring duplicated sequences. For example, both query terms chloroplast, cyanobacteria get the sequence >gi 72381840 ref NC 007335.1 Prochlorococcus marinus str. NATL2A. chloroplast, cyanobacteria, and plastid get the sequence >gi 42592260 ref NC 003070.5 Arabidopsis thaliana chromosome 1. apicoplast, chloroplast, and plastid get the sequence >gi 31442363 ref NC 004823.1 Eimeria tenella chloroplast. The design of the data model ensures that the same sequence can not be inserted to the table twice. However, the query term used to get the sequence should be recorded in the query by term field. This information may help scientists better understand the relationships and discover new insights. Data accuracy: Since scientists are interested in particular gene sequences that reside in a range of a complete genome sequence, instead of searching the gene database of NCBI, we choose to parse the XML file of a complete genome sequence to get particular gene sequences and gene products. In such way, the accuracy of the data may be guaranteed. The NCBI service provides the search result in XML format; an example is shown in Figure D.1. The data collection service provided in the MoGServ parses the XML file to find the INSDFeature key tag for each INSDFeature and then to see if it is a CDS (CoDing sequence, i.e., a region of nucleotide that corresponds to the sequence of amino acids in the predicted protein). The next step is to find the INSDQualifier name and INSDQualifier value pair for 80

the gene name (e.g., atpd) or gene product description (e.g., ATP synthase subunit B). Gene names and gene product descriptions with scientific interests are defined by users through the web interface. In most cases, gene names are enough to get the desired CDS. However, due to the incomplete annotation of the CDS, gene names may be not available for a particular CDS in the nucleotide sequence. The gene product description becomes another criterion for getting accurate CDS; an example of a gene sequence in fasta and TinySeq XML format is shown in Appendix D.2. Exception handling: Since the data is retrieved from a remote data source, NCBI, using web service interfaces, failures may occur because of the network, hardware, or services themselves. Recording the execution status of a data collection service is important for detecting and recovering when a failure appears. Since the data collection service normally runs periodically as a batch job, we record the status in a log file on the file system. In order to reduce the repetitive work when a failure occurs, we treat retrieving a single sequence as a transaction. In another words, we sacrifice I/O performance to the databases that could be possible using batched mode transactions. Data consistent: Data analysis is an important component provided by the MoGServ. Different data analysis tools require different data formats of a set for their input. There are two ways to provide the desired data format, converting the data on-the-fly or preparing the data and storing it in the database during the data collection process. The first approach is flexible; however, it may result in inconsistent naming problems for sequences. For example, the same sequence in set A may have the different name in set B. Therefore, we use an algorithm to map sequence name at the data collection 81

process. Each sequence has a fixed name for each format. Each duplicate name is ordered by adding numeric numbers at the end of the name. 4.2.4 Local query After desired sequences are stored in the local database, users need a way to find an interested subset of these sequences in order to perform further data analysis. The system provides an interface for users to query the local database using free text searching. The underlying search engine is built with the Lucene search library. The content in the index includes metadata that are used to describe a sequence, such as taxonomy, term, name, and etc. For example, users can use a query atp synthase AND B AND plastid to get a number of sequences(see Figure B.4). Users can manipulate these returned sequences and group these sequences as a set. Users can also download these sequences in a variety formats. 4.2.5 Set management In order to help scientists preparing the data set for subsequent data analysis, MoGServ provides set management services to: Creat set: With an appropriate query to the local database, users can look into the list of sequences returned from the query and delete undesired sequences. Users can create a new set using these sequences. These sequences can also be added into an existing set. Upload set: Users can upload a set of sequences in fasta format into the local database. These sequences can be from users own lab experiments, which may not be ready to submit to the public database. They can also be a 82

small number of sequences not in the local database at that time. These sequences are annotated using the appropriate metadata description. Show set: Users can query the information of a set as shown in Figure B.6, such as the creation date, the origination of the set, etc. Download set: Users can download a set in a variety of formats, such as fasta format, NEXUS format. Set filter: This service provides the capability to find the intersection of all the organisms (species) given a number of sets that contains gene or protein sequences in different species. The purpose of this service is to help scientists preparing data to determine if the gene genealogies for the subunits are different. For example, scientists may be interested to determine if gene genealogies for the subunits α, β, γ, δ, ε of ATP synthase CF1 are different. The first step is to form 5 sets using query such as ATP AND synthase AND delta AND CF1. Then use the set filter services to find all organisms (species) that contain all of the gene or protein sequences type. These sequence sets will be used in the subsequent data analysis such as using ClustalW to construct phylogenetic trees. While constructing a phylogenetic tree based on the analysis on a single gene or protein taken from a group of organisms (species) can be problematic, the analysis based on multiple unrelated gene or protein sequences may increase the soundness of the results. 83

4.2.6 ClustalW Multiple alignments of sequences provide information to identify the conserved sequence regions. ClustalW is a tools for global multiple alignment (across their entire length) of DNA and protein sequences. EMBL-EBI provides a soap-based web service that allows programmatic access to the data analysis tool [72]. Two other services, T-Coffee and Muscle, are implemented using newer algorithms to improve the accuracy and achieve higher performance. Based on users preference, we integrated the ClustalW service into the MoGServ. The integration of a service in the system is done by creating a new java-based program using a web service interface to invoke the remote service. Instead of copying, pasting, or uploading a sequence file, users can set up the parameters from a web interface as shown in Figure B.9. These parameters are accepted and combined as a XML file that is sent to the new program as input; an example file is shown in D.6. The input and output are stored into the database, so the information can be queried later and displayed with XSLT. The input and output information are delivered with XML/XSLT. The output from ClustalW includes phylogram tree, cladogram tree, distance, and ph file based on the parameter setting. The binary results can be viewed using a Javabased multiple alignment editor, Jalview [16]. 4.2.7 Blast The Basic Local Alignment Search Tool (BLAST) algorithm and the implementation at NCBI [1] is one of the most widely used bioinformatics programs. It is used to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. With well designed queries and align- 84

ments, the results of BLAST can infer functional and evolutionary relationships between sequences and may provide important clues to the function of uncharacterized sequences. There are several alternative implementations, WU-BLAST 2, FSA-BLAST 3, parallel blast 4, available for better performance with minimum loss of sensitivity. EBI and NCBI provide web-based WU-BLAST and/or NCBI-BLAST. However, it could not meet the requirements of this particular application in two aspects: 1) a large number of sequences needs to be downloaded, copy and pasted to the interface; 2) sequences alignment only can be compared against databases in EBI or NCBI, thus users could not defined their own datasets to conduct comparisons. BLAST requires two sequences as input: a query sequence (also called the target sequence) and a sequence database. BLAST will find subsequences in the query that are similar to subsequences in the database. Hosting a service on the MoGServ eliminates these two limitations. Users can define the compare set and database sets. The result is stored in the local database. The job information is accessible any time when needed. The service has two execution methods, synchronized and asynchronized, which are the same for every data analysis service provided in MoGServ. Similar to the ClustalW service, the Blast service accepts input in XML format, as shown in D.7. A tblastn web interface is shown in Figure B.8. 2 http://blast.wustl.edu/ 3 http://www.fsa-blast.org/ 4 http://www-users.cs.umn.edu/ rangwala/final bglblast.pdf 85

4.2.8 Phylip and Paup PAUP* 5 is a program for phylogenetic analysis using parsimony, maximum likelihood, and distance methods. The program features an extensive selection of analysis options and model choices, and accommodates DNA, RNA, protein and general data types. Among the many strengths of the program is the rich array of options for dealing with phylogenetic trees including importing, combining, comparing, constraining, rooting and testing hypotheses. PAUP* uses the NEXUS file format, which is a modular format used by several programs. All versions require data and commands to be present in the NEXUS format (with the exception that commands can additionally be executed interactively from the command prompt). PHYLIP 6 is a set of modular programs for performing numerous types of phylogenetic analysis. Individual programs are broadly grouped into several categories: molecular sequence methods; distance matrix methods; analyses of gene frequencies and continuous characters; discrete characters methods; and tree drawing, consensus, tree editing, and tree distances. Together the programs accommodate a broad range of data types including, DNA, RNA, protein, restriction sites, and general data types. The programs encompass a broad variety of analysis types including parsimony, compatibility, distance, invariants and maximum likelihood, and also include both jackknife and bootstrap re-sampling methods. Therefore for a typical analysis the user makes choices regarding each aspect of an analysis and chooses specific programs accordingly. Programs are run interactively via a text-based interface that provides a list of choices and prompts users for input. 5 http://paup.csit.fsu.edu/ 6 http://evolution.genetics.washington.edu/phylip.html 86

Phylogentic trees generated from these phylogenetic analysis tools can be viewed using TreeView 7 [68]. TreeView is a simple program that displays a phylogenetic tree of up to a certain number of taxa. Phylogenies may be displayed either as slanted or rectangular cladograms. TreeView provides a way to view the contents of a NEXUS, PHYLIP, ClustalW or ClustalX, or other format tree files. 4.2.9 Data conversion In a typical workflow, one program s output may be used as the next program s input in the workflow. A necessary data conversion process is needed in order to make the output suitable as the input for the next program. MoGServ provides a number of services to convert fasta format to ClustalW format, fasta format to NEXUS format, and so on. The program readseq 8, developed by D. Gilbert, is a reformatting program used to reformat DNA or protein sequence data. It allows the input of single or multiple sequences in 18 different formats and converts to a specified format. MoGServ integrates readseq program as a service to convert the output from ClustalW to the NEXUS format. 4.3 Results of case studies The evolution of ATP synthase is considered severely constrained; the structure of ATP synthase is shown in Figure A.1. It can be a candidate for ascertainment of deep phylogeny. The step that we use to test the hypothesis is first to identify individual subunit genealogy, then to merge the data and reanalyze the data. 7 http://www.molecularevolution.org/software/treeview/ 8 http://iubio.bio.indiana.edu/soft/molbio/readseq/java/ 87

4.3.1 Case study: the rediscovery of Erythrobacter litoralis The MoGServ local database includes whole genome sequences from chloroplasts, cyanobacteria, plastid, and apiocolasts. The biological investigator hypothesized that the amino acid sequence of the chloroplast subunits of ATP synthase would be a good choice for a deep phylogenetic analysis, a departure from established procedures. DNA sequence from ribosomal genes, the protein synthesizing machinery, are the traditional choices for deep phylogenetic analysis. Preliminary analyses on 33 taxa revealed that the α and β subunits of this enzyme have a stunningly high degree of amino acid sequence conservation across cyanobacterial genomes and chloroplast genomes from a wide array of algal taxa and green plants. As the nuclear genomes of the algal taxa are more phylogenetically distinct from one another as humans are from fungi, this result indicated that that chloroplast ATP was a suitable candidate enzyme and provided support for the single ancestor hypothesis. The problem now was one of excessive conservation in the α and β subunits. Comparison of the most conserved region of the α subunit against all sequences at NCBI revealed that this region is so conserved that it matches that of an ATP synthase subunit in the mitochondrial genome. Phylogenetic evidence clearly indicates that mitochondria descend from a single bacterial ancestor and that this ancestor was related to the alpha proteobacteria, a group closely related to the cyanobacteria. MoGServ enabled the investigator to add to the already convincing evidence that the mitochondrial and chloroplast genomes are related. This was not the hypothesis of interest but lead the investigator to try a different approach. The investigator then examined the amino acid sequence of the ɛ subunit of ATP synthase for the same 33 taxa examined previously. Sequence conservation 88

was evident but somewhat less than that seen in the α and β subunits. The local database query was relaxed to permit inclusion of the ATP synthase ɛ subunits of both cyanobacteria and alpha proteobacteria. More than a dozen proteobacteria were identified, all of which except one are nonphotosynthetic. The surprise bacterium was Erythrobacter litoralis. This organism is a facultative photoheterotroph, able to photsynthesize in the light and catabolize organic sources in the dark. It was found in the Sargasso Sea in 1994 and sequenced in 2005. This discovery suggests that Mother of Green may not be a cyanobacterium but an α proteobacteria. 4.4 Summary In this chapter, we detail the data and services integrated in the MoGServ system in order to support the deep phylogenetic investigations. We describe one case study for a phylogenetic investigation 9. This case study shows that the investigator is able to gather data and perform more advanced data analysis and lead to dicovery new knowledge using the web based environment and services provided in the MoGServ system. 9 The case study on the use of MoGServ for a phylogenetic investigation, was conducted in collaboration with Professor Jeanne Romero-Severson[78], Department of Biological Sciences, University of Notre Dame, and partially supported by the Indiana Center for Insect Genomics (ICIG) with funding from the Indiana 21st Century fund. 89

Figure 4.2. Entity relationship diagram of the data model in MoGServ created by SQL::Translator 90

CHAPTER 5 ONTOLOGICAL REPRESENTATION MODEL MoG (Mother of Green), a project involving deep phylogeny of plastids, includes the development of a system (MoGServ) to enable life scientists to easily aggregate heterogeneous data and conduct data analysis using the growing array of web-based scientific databases and analysis tools. MogServ, a SOA-based data integration environment, is built using current web service technology and existing middleware for life sciences research. Based on the successful design and implementation of this prototype, in this chapter, we present an enhanced system with semantic annotation of services and data. The enhancement aims at allowing life science researchers to define their experiments at different levels based on their knowledge of the tools, data, and the system. The semantically enriched data allows easier reuse, sharing, and experiments involving search to be conducted. While the service-oriented architecture is used in the implementation of e- Science infrastructure, semantic web technology is increasingly gaining interest to be used for annotating the life science and medical information [12]. For example: UniProt RDF 1 project provides all UniProt protein sequence and annotation data in RDF. These efforts makes the vision of the semantic web [7] become more 1 http://dev.isb-sib.ch/projects/uniprot-rdf/ 91

practical. Other open source projects, such as HayStack 2 and SIMILE 3, aim at delivering these semantically annotated data to web browsers. The appearance of open source tools that support the semantic web and service-oriented computing encourage the life science community to provide their data, analysis tools, and share scientific experiments with these technologies. 5.1 The MoG life sciences project and biomedical application As part of the Mother-of-Green (MoG) project 4 we are developing scientific workflow tools (MoGServ) that enable end-user composed semantic web-services to increase the interoperability of the growing array of web-based life science databases and analysis tools. These workflow tools are built from available and emerging open-source, open-standards technology. The prototype problem domain that guides this project, the phylogenomics of the plastid, includes genomic, transcriptomic, and proteomic data. Plastids are hypothesised to be descendants of cyanobacterial ancestors captured by eukaryote hosts. As more cyanobacterial and plastid genomes are sequenced, information accumulates that could shed light on plastid genomics and phylogeny. One of the major plagues of humankind, malaria, is caused by a parasite containing a plastid: Plasmodium falciparum. A new pharmaceutical drug that disrupts the function of this plastid (the apicoplast) might be harmless to humans, who, like all animals, have no plastids. Examination of the genes, the linear order of the genes, the proteins, and the temporal order of protein expression of related organisms can suggest possible 2 http://haystack.lcs.mit.edu/ 3 http://simile.mit.edu/ 4 http://www.nd.edu/ mog/ 92

apicoplast functions. The problem is the accurate identification of relatives or even closely related plastid genes of known function. At present, the phylogeny of the apicoplast is not clear. A phylogenomics approach requires the extraction and analysis of genomic information from diverse scientific disciplines: plant, algal and cyanobacterial systematics, plant biochemistry, animal parasitology, genetics and cell biology. This phylogenomics investigation provides software design use-cases, testing, and an opportunity for the evaluation of scientific workflow composition tools and technology. 5.2 Ontological representation model Metadata about services, sequences, and users experimental results are captured in MoGServ in order to facilitate the information inquiry from application developers searching for appropraite services and from end-users to keep track of their in-silico experiments. The inquiry system in the prototype is based initially on a keyword search method for easy implementation purpose. With the prospect of hosting MoGServ at multiple sites in the phylogenetic research community, applying the semantic web approach for representing the metadata allows for much more focused and structured queries and the possibility to answer questions based on logical inference rather than text associations. An ontology that describes the concepts relevant to a given domain along with properties characterizing these concepts can meet these requirements. By relying on shared ontologies and agreements on the definition of common concepts, data and information can be annotated using the shared vocabularies in these ontologies. Since most semantic web services standards are relatively mature and stable, we build an application-specific ontology using a distributed and modularized 93

ontology structure and re-used some cross-domain ontologies such as the Dublin Core 5 and other well-defined bioinformatics ontologies. The use of well-defined ontologies could potentially increase the interoperobility when information is published on the web. There are three ontology sets that are clearly differentiated in the system: MoG application domain ontology, which is used to represent concepts and information unique to MoGServ system, such as jobs, sequences collections, etc; generic service description ontology, such as OWL-S, which is used to specify generic web service concepts such as service inputs, outputs, preconditions, and effects; and the service domain ontology, which is designed and used for the semantic description of web services in the bioinformatics domain. 5.2.1 RDF, OWL, and DIG reasoner The Resource Description Framework (RDF) 6 has been proposed as a W3C standard to enable distributed knowledge representation on the Semantic Web. It is a graph model of the statements that encode the metadata description of web resources, people, places, and other concepts. RDF is based on the idea of identifying things using Uniform Resource Identifiers (URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values. An RDF graph is a set of triples. Each triple consist of a subject(start node), a predicte(edge), and an object(end node). A fact is expressed as a Subject-Predicate-Object triple, also known as a statement. A triple can be written as P (S, O), that is, a subject 5 http://dublincore.org 6 http://www.w3.org/tr/rdf-primer/ 94

S has P (predicate or property) with value O. RDF/XML and Notation 3 (N3), are two formats for representing RDF models. Figure 5.1 is a RDF graph model that represent some information for describing the MoG project web site. Facts are expressed as subject-predicate-object triples: < http://www.nd.edu/~mog > <#hascreator> < #gmadey> <#gmadey> <#hasfullname> <Gregory. Madey> Note: # is represent as some URIs The RDF/XML representation: <rdf:rdf xmlns:rdf= http://www.w3.org/1999/02/22-rdf-syntax-ns# xmlns:ex= http://someexample.org# > <rdf:description rdf:about= http://www.nd.edu/~mog > <ex:hascreator rdf:resource= ex:gmadey /> </rdf:description> <rdf:description rdf:about= ex:gmadey > <ex:hasfullname>gregory Madey</ex:hasFullName> </rdf:description> </rdf:rdf> MoG is a project #hastextdescription #foundation #hasfundedby http://www.nd.edu/~mog #bioinformatics #hasresearchtopic #hascreator #hasfullname #gmadey #hastitle #haspersonalsite Gregory Madey #professor http://www.nd.edu/~gmadey Literal Resource # URI provided the definition of these vacabulary Figure 5.1. A RDF graph model to represent some information for describing the MoG project web site 95

RDF schema is a mechnism that allow developers to define a particular vocabulary for specifying the kinds of objects to which predicates can be applied. These pre-defined terminologies such as Class, subclassof, Property establish an agreement on the semantics of specified terms and the interperation of given statements. The Web Ontology Language (OWL) 7 is one type of ontology language available for describing semantic web information, which is more complex and powerful than the RDF schema. It is built on top of the RDF graph model with better capabilities for describing the relationship among resources and their properties 8. The OWL language is divided into three syntax classes: OWL Lite, OWL DL and OWL Full. Classes(concepts), properties(roles, relationships), and individuals(instances) are three components in OWL language. Let s consider that the interperation of a domain knowledge using function I. A domain knowledge is represented with a number of concepts C I D I. Each concept may contain a number of individuals and one individual may belong to different concepts I I D I. The relationship between two individuals are represented as R I D I D I. The web data information can be related by using the definition of these concepts. Jena 9 is a semantic web framework for creation of RDF and OWL models as well as a common interface for parsing and reasoning. Protege 10 is a free, open source ontology editor and knowledge-base framework that supports two main approaches to modeling ontologies via the Protege-Frames and Protege- OWL editors. The OWL DL ontology can be translated into a description logic 7 http://www.w3.org/tr/owl-ref/ 8 http://jena.sourceforge.net/ontology/index.html 9 http://jena.sourceforge.net 10 http://protege.stanford.edu/ 96

representation that are decidable fragments of First Order Logic (FOL) 11. A Description Logic Reasoner can perform automated reasoning over an ontology, such as computing the inferred superclasses of a class, determining whether or not a class is consistent, deciding whether or not one class is subsumed by another (subsumption reasoning). Pellet 12, FaCT/FaCT++ 13, Racer/RacerPro 14, KAON2 15 are four popular ones among a number of DL reasoners. The DIG 16 interface specifies a common interface for DL reasoners. A DIG compliant reasoner is a DL reasoner that provides a standard access interface (DIG interface), which enables the reasoner to be accessed over HTTP using the DIG langauge. Jena and Protege-OWL provide APIs that can be used to interact with any external DIG compliant reasoner without requiring developers to have detailed knowledge of the reasoner. 5.2.2 Generic service description ontology OWL-S 17 is an OWL based ontology for semantic representation of services. It is a complex and rich model that includes the representation of both atomic services and composite services as well as complicated control flow and data flow. Most of the current open-source APIs, editors, and annotation tools at this stage only partially support the OWL-S service model having primary focus on the 11 Logics are decidable if computations or algorithms based on the logic will terminate in a finite time. 12 http://pellet.owldl.com/ 13 http://owl.man.ac.uk/factplusplus/ 14 http://www.racer-systems.com/ 15 http://kaon2.semanticweb.org/ 16 http://dig.sourceforge.net/ 17 http://www.w3.org/submission/owl-s/ 97

OWL-S service profile and service grounding. Annotating a service with the OWL- S model is a non-trival task even with support from annotation tools, such as the SRI OWL-S editor 18. The Feta [50] data model is used for semantic description of services in the my- Grid project. Web services can be annotated using terms in a OWL-base mygrid domain ontology [103] with an GUI based interface Pedro [33]. This approach is more lightweight than the OWL-S approach. Although OWL-S provides more support for the automation process, especially since its definition of the precondition and post effect allows the possible application of AI planning technologies, it is difficult to utilize its full functionality. The Feta data model has limited expressivity but sufficient for describing most services and its simplicity makes it more practical for describing large number of services. We believe it is more practical to use the Feta model for service and workflow description at this stage. Since the semantic representation model in the system is modularized, it is easy to convert to an OWL-S representation when the tools and API that support the OWL-S becomes more stable and mature. 5.2.3 Service domain ontology The service domain ontology should be generic enough to provide the concepts needed by any web service in a certain domain, and rich enough to represent the available knowledge for performing complex reasoning. The service domain ontology plays an important role for the automation of service discovery. However, building such a quality domain ontology is a challenging task. Sabou et. al [80] presents an automatic method that learns a domain ontology for the purpose of 18 http://owlseditor.semwebcentral.org/ 98

web service description from natural language documentation of web services. It provides a guideline and tool for domain experts to inspect a large number of web services in a certain domain in order to build a high quality generic ontology. BioMOBY s object ontology, MOBY-S 19, contains concepts that are related to data formats and data types usually used in bioinformatics. restrictions on complex relationship definitions in the ontology. There are no It serves as a common vocabulary collection that can be used to define services that accept a particular type of data as their input/output in certain format. The mygrid ontology 20 describes the bioinformatics research domain and the dimensions with which a service can be characterised from the perspective of the scientist. The scope of the ontology is limited to supporting service discovery. Descriptions of services are constructed to present their properties such as what the service does, what data sources it accesses, and what domain specific methods the analysis involves. Each hierarchy contains abstract concepts to describe the bioinformatics domain at a high level of abstraction. By describing the domain of interest in this way, users should be able to find appropriate services for their experiments from a high level view of the biological processes they wish to perform on their data. 5.2.4 MoG application domain ontology The MoG application domain ontology auguments the two ontology sets described above, representing concepts that only exist in the MoGServ system, including jobs, collections of sequences, etc. The ontology definition provides vocabulary to annotate services that use data types and data formats not available 19 http://biomoby.org/resources/moby-s/objects 20 http://www.mygrid.org.uk/ontology 99

elsewhere. It also allows the annotation of experimental data permitting users to keep track of their data. The MoG application domain ontology also represents the interactions between end-users and the system. Sequence, SequenceSet, Job are three main concepts in a MoG application. The MoGServ system contains a local database that stores integrated sequences of scientific interests from multiple public databases along with private data from the life scientists own laboratory experiments. One activity a scientist may need to do often is to query the local MoGServ database to get a collection of sequences supporting a particular research investigation, and use this collection to do subsequent data analysis. We also define other concepts User, Input, Output, Privacy to annotate the access permissions for data sets. For example, if the data is updated from a scientist s lab experiment which is not intent to be published at one point, this piece of data should be retricted to be used by authorized persons only. The ontology is defined with OWL using Protege. Each concept consists of two main types of properties: object properties and datatype properties. An object property represents the relationship between two individuals in the domain. A datatype property links an individual to an XML Schema data type or a RDF literal. Figure 5.2 demonstrates the main concepts and relationships defined in the MoG application domain ontology. The Sequence class has multiple properties: 1) hassequenceid, a unique identifier of the sequence - an identifier may be in the life science identifier (LSID) format, 2) hassequencename, a string of XML data type, 3) hastaxonomy, a datatype property with a string of XML typed data - each individual of the sequence class may either be retrieved from a public data base or uploaded by 100

Figure 5.2: Main concepts and partial relationships defined in the MoG application domain ontology scientists from their own labratory experiments. SequenceSet class has property: ischildof is a functional property, which means there can be at most one individual that is related to the individual via the property. A sequence set can only be a child of one sequence set no matter how the sequence get created. A sequence set can have multiple child sequence sets. A sequence set can be a sibling of another sequence set only when other sequence sets are also generated from the setfilter service. The property issiblingof is a symmetric property. The existential restriction hassequence Sequence indicates a necessary condition for an individual if it belongs to the class SequenceSet. The Job class has execution time properities such as submittedat, startedat, finishedat. This informtion provides data for measuring Quality of Services (QoS). It also provides information for end-users to monitor their job execution. 101

5.3 Implementation Given a well-defined domain ontology, associated services, workflows and the data products generated from these, the services and workflows can be annotated using a common vocabulary. The meta data with semantic annotation is stored in a RDF repository. From a number of RDF storage packages, we chose Sesame 1.2.6 21 as the repository. Sesame is an open source Java framework for storing, querying and reasoning with RDF and RDF Schema. Using RDF as the main storage and exchange method makes knowledge in the field portable to other applications and readable by machine as well as by human. The annotation in RDF/XML format of one service provided in the MoGServ system is shown as below and displayed in Figure E.5. It is a service that accepts a sequence set id and sequence type as input parameters, executes ClustalW sequence analysis, and returns the result. <rdf:rdf xmlns:mygrid="http://www.mygrid.org.uk/ontology#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:mog="http://almond.cse.nd.edu:10000/mog#"> <mygrid:service> <mygrid:hasoperation> <mygrid:operation> <mygrid:isfunctionof> <mygrid:operationapplication> <rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#aligning"/> </mygrid:operationapplication> </mygrid:isfunctionof> <mygrid:outputparameter> <mygrid:sequence_alignment_report> <mygrid:myginstance rdf:resource= "http://www.mygrid.org.uk/ontology#sequence_alignment_report"/> <mygrid:hasparameterdescriptiontext>clustalw alignment file </mygrid:hasparameterdescriptiontext> <mygrid:hasparameternametext>filename </mygrid:hasparameternametext> <rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/> 21 http://www.openrdf.org/ 102

</mygrid:sequence_alignment_report> </mygrid:outputparameter> <mygrid:usesresource> <mygrid:operationresource> <rdf:type rdf:resource= "http://www.mygrid.org.uk/ontology#sequence_database"/> </mygrid:operationresource> </mygrid:usesresource> <mygrid:performstask> <mygrid:aligning> <rdf:type rdf:resource= "http://www.mygrid.org.uk/ontology#operationtask"/> </mygrid:aligning> </mygrid:performstask> <mygrid:hasoperationnametext>runclustalwdf </mygrid:hasoperationnametext> <mygrid:inputparameter> <mog:set> <mygrid:myginstance rdf:resource= "http://almond.cse.nd.edu:10000/mog#set"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/> <mygrid:hasparameternametext>setid</mygrid:hasparameternametext> </mog:set> </mygrid:inputparameter> <mygrid:inputparameter> <mygrid:parameter> <mygrid:myginstance rdf:resource= "http://www.mygrid.org.uk/ontology#biological_sequence"/> <rdf:type rdf:resource= "http://www.mygrid.org.uk/ontology#biological_sequence"/> <mygrid:hasparameternametext>sequencetype </mygrid:hasparameternametext> </mygrid:parameter> </mygrid:inputparameter> </mygrid:operation> </mygrid:hasoperation> <mygrid:hasservicenametext>mog:service:clustalw </mygrid:hasservicenametext> <mygrid:locationuri rdf:resource= "http://almond.cse.nd.edu:10000/axis/services/clustalw?wsdl"/> <mygrid:hasservicetype>wsdl</mygrid:hasservicetype> <mygrid:publishedby> <mygrid:organisation> <mygrid:publishedby> <mygrid:organisation> <mygrid:hasorganisationnametext>mog </mygrid:hasorganisationnametext> <mygrid:hasorganisationdescriptiontext>mog </mygrid:hasorganisationdescriptiontext> </mygrid:organisation> 103

</mygrid:publishedby> </mygrid:organisation> </mygrid:publishedby> <mygrid:hasservicedescriptiontext>this is a service accepts setid, sequecentype as parameters and return the name of the alignment report stored in the local database</mygrid:hasservicedescriptiontext> <mygrid:hasservicedescriptionlocation>http://almond.cse.nd.edu:10000 /axis/services/clustalw?wsdl</mygrid:hasservicedescriptionlocation> </mygrid:service> </rdf:rdf> All the data sets stored in the local database are generated by a service or a workflow. The annotation of experimental data is through services provided in the MoGServ system. These services are invoked automatically when an individual is created. Each sequence, set of sequences, and job are identified with the LSID. The Life Science Identifiers (LSID) 22 is a special kind of Uniform Resource Name (URN) for biological entities. The LSID concept defines an approach for naming and identifying data resources stored in multiple, distributed data stores. Since adoption of the LSID in the life sciences is increasing, using it as an identifier for experimental data provides the extensibility of our system to publish those data. We implement a number of software components to annotate and query meta data including job information, services/workflows descriptions (See Figure 5.3). The query components embed queries in the SeRQL (Sesame RDF Query Language) format supported by Sesame. 5.4 Conclusion In this chapter, we present an ontological model that is used to semantically annotate data and services in the MoGServ system. This ontological model contains three ontology sets: MoG application domain ontology, generic service description 22 http://lsid.sourceforge.net/ 104

MoGServ application Domain Ontology (MoGServ) Service Domain Ontology (mygrid) Generic Service Description Ontology (mygrid/feta model) Ontological modules Use vocabularies defined in these ontological modules to annotate and query Query templates Query Components RDF Store result Annotation Components Annotation Templates (Service Workflows) Annotation Templates (Data) 2007-4-18 Ph.D defense 69 Figure 5.3. The software components implementation of annotation and querying meta data ontology, and service domain ontology. Using a distributed and modularized ontology structure and reuse of well-defined ontologies could potentially increase the interoperobility when the data generated from the MoGServ is shared with other researchers. At this stage, the developed MoG application domain ontology is simply served as a common vacabulary definition to capture the relationship among data set, sequences, jobs, and other properties related to these three concept. The Feta data model is used to annotate services in the MoGServ. Compared to the table and index based metadata search method, the semantically-annotated experimental data provides a better, flexible approach for users to search and share their experiments. However, how to annotate the meta data accurately and efficiently becomes the major difficulty in applying ontological model. 105

CHAPTER 6 IMPROVING THE REUSE OF THE SCIENTIFIC WORKFLOW Most current practical methodologies and workflow systems for service composition and workflow creation in e-science pursue a semi-automatic way to allow users to discover and select appropriate services to include in a workflow based on semantic and conceptual service definitions. This effort shifts the load of requiring users to have detailed knowledge and understanding of each tool, service, and data type. However, few of these approaches consider the potential for reuse: to share the knowledge gained during the service composition process and to reuse complete or partial reuse of existing workflows. We believe that providing a capability for reuse of this knowledge and workflows could be an important component in a workflow system. In this chapter [109], we present a methodology and an enhanced system design to facilitate the reuse of knowledge and workflows. It contains 1) a hierarchical workflow structure representation, 2) knowledge management and knowledge discovery components to capture and manage the reusable knowledge in a system, and 3) an approach for using a graph matching algorithm to discover similar workflows. 6.1 Introduction As more data, analysis tools, and other resources are delivered as services on the web, the major benefit of adopting service-oriented architecture in e-science, 106

is that of allowing scientists to describe and enact their experimental processes by orchestrating distributed and local services into a workflow. Service orchestration, also called service composition, is a difficult and complex task. It often involves choosing a set of appropriate services based on the functional and non-functional properties of services, ordering them in sequence, resolving connectivity between the services, and converting the complex process into a target workflow language that can be deployed and invoked on a platform. Over the past several years, much research has been done on approaches for service discovery and composition in order to achieve the goal of seamless web service composition [58]. These approaches range from both adoption of industry standards to adoption of semantic web technology, and from manual or static composition to automatic dynamic composition [90]. A significant portion of the work aims at automating discovery and composition by combining ontological annotation of services and AI planning technology. In the literature, the demonstration of these approaches is largely applied on virtual travel agencies or small well-defined domains. Applying these approaches to larger, more complex and less-defined applications can be difficult, especially before a complete strong ontological agreement is established in the application domain or across multiple domains. Most current practical methodologies for service composition or workflow creation employ a semi-automatic design that allows users to discover and select appropriate services to include in a workflow based on semantic and conceptual service definitions. This partially lifts the load on the users of requiring detailed knowledge and understanding of each tool, service, and data type. In the meantime, it increases the complexity of building such middleware to support workflow 107

creation at a higher level abstraction. Mediator, shim, and adaptors technologies [74] are applied to resolve the connectivity between the services. Several workflow management systems and service-oriented middleware, such as Pegasus [34], mygrid/taverna [65], Kepler [52], and Triana [96], are developed and with the intent to streamline the workflow design, execution, monitoring, and re-run the workflow. Most of these systems and approaches provide users an environment to compose services from scratch in terms of more accurately choosing appropriate services with consideration of semantic matching and quality of services (QoS). Fewer of them consider the potential of reuse and sharing of the knowledge gained during the service composition process and reuse of complete or partial existing workflows. We believe that providing a capability to reuse the knowledge and workflows is an important component in such a system. This reusability will lead to a more efficient and more structured composition process that will accelerate rapid application development. It will provide more valuable guidelines to assist users with their workflow creation using knowledge that has been gained and verified by others. Reuse of the verified knowledge will potentially increase the correctness of composed workflows and reduce the errors that may be caused by misannotation, inaccurate annotation, and incomplete annotation of services. The requirement of complete information about the world brings challenges of applying traditional AI planning technologies into the service composition process since it is not feasible nor possible to collect all the information to form a complete initial state of the world [46, 58]. The gradually gathered knowledge in the system during service composition process may help accumulate more complete information for an AI planner. 108

In this chapter, we present a methodology and an enhanced system design to facilitate the reuse of knowledge and workflows. It contains a hierarchical workflow structure, knowledge management and knowledge discovery components to capture and manage the reusable knowledge in a workflow system, and an approach for using a graph matching algorithm to discover similar workflows. The methodology proposed is being used in the design and implementation of a service-oriented based system for supporting bioinformatics research. 6.2 A hierarchical workflow structure We define a hierarchical workflow structure that contains four levels of representation (see Figure 6.1): abstract workflow, concrete workflow, optimal workflow, and workflow instance. Abstract workflow Task A Task B Encode, convert the High level definition To low-level executable Service A Service B Service D Service C Concrete workflow Replace individual Services with their optimal alternatives Service A Service B Service D Service C Optimal workflow Service A input Service B output Service D Service C Workflow instance Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Figure 6.1. A four level hierarchical workflow structure representation and transformation of scientific processes 109

Abstract workflow is a definition of a scientific process with emphasis on the analytical operations or function to be performed rather than the mechanisms for performing these operations. Concrete workflow is a definition of a number of tasks represented as actual executable services. A concrete workflow can be converted to specific workflow language and sent to a workflow engine to be executed. Optimal workflow is a concrete workflow where individual executable services are replaced by alternatives with highest quality. Workflow instance is an actual run of a concrete workflow or optimal workflow with input data and generated output data. Users can use a GUI-based interface to define an abstract workflow by dragging and dropping high level abstracted components provided in the system. An alternative way is to define an abstract workflow using standardized syntax, vocabularies, and semantics developed in their scientific communities. Users logically create each task in terms of functions they wish the task should accomplish. The translation of an abstract workflow into a concrete workflow is a process of discovering suitable services that implement these functions and solving the connectivity between services. The optimization of a concrete workflow into an optimal workflow is a process of ranking services based on a set of metrics and selecting an optimal service to replace each service in the workflow. A concrete workflow can be invoked repeatedly with different input parameters. Since a scientific process is a process for discovering new knowledge, keeping track of the source of a workflow result can be as important as the result itself. 110

Data provenance is metadata recording the process of experiment workflows, annotations, and notes about experiments. It provides significant added value in such data intensive e-science [83]. Many data provenance systems in e-science have focused on recording the data from which a data product evolved and the process of transformation of these data, i.e., input data, output data, and process. This may include information on running time and failure rates of each running instance of a workflow; these can provide measurements for profiling the quality of services and workflow. This information can be used to assist the workflow optimization process. Several benefits are provided with this hierarchical workflow structure definition: Allows users to define workflow at different abstract levels. Less experienced users may define a workflow in terms of functions they wish a task should perform. Intermediate users may define a workflow with more detailed properties of each task, such as the algorithm and data source they may want to use. Expert users may be able to define a workflow in an ad-hoc approach by choosing appropriate executable services and form a workflow with appropriate logic. An example is shown in Figure 6.2. Users would like to conduct an experiment to determine if gene genealogies for ATP subunit α, β, γ are different. A less experienced user may define a workflow with two tasks, retrieving and aligning. An intermediate user may have knowledge of two particular services (querygene, clustalw) that should be used in the workflow in order to perform each task. An expert bioinformatician may know that in order to get more accurate results, it is necessary to encapsulate a service (setfilter) to compute the intersection of all the organisms in the sequence sets. 111

Three user-defined workflows from different views Question: are gene genealogies for ATP subunit different? Retrieving Aligning Workflow A defined by a less experienced user using the functional definition of services querygene querygene clustalw querygene setids setfilter clustalw querygene clustalw clustalw Workflow B defined by an intermediate user with executable services Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process 2007-4-18 Ph.D defense 34 Figure 6.2. An example illustrates the user-oriented workflow definition with different levels of knowledge Allows the transformation of workflows in semi-automatic or automatic ways. The transformation from abstract workflow to the concrete workflow can be completed by an expert bioinformatician with assistance from a service discovery agent provided in the system. The mygrid/taverna [65] workbench provides users not only a visual workflow building tool but supportes the annotation and discovery of services using an ontology. IRIS [74] provides an approach to create, discover, and manage adapters (mediators) that are intended to glue two bioinformatics services together with appropriate data transformation, identifier mapping and so forth. The BioMoby [44] project integrates access to many of BioMoby s features to the Taverna interface in the form of a Taverna plugin. Users are guided through the construction of syntactically and semantically correct workflows through plug-in calls to the Moby Central registry. 112

The transformation from the concrete workflow to the executable workflow can be completed automatically by ranking services and choosing optimal ones. Most of previous work bound the information regarding the quality of service with the translation process results in more sophisticated and complex composition methods. Since most measurements of the quality of services are dynamically changing, this tightly-coupled representation and composition method is not easily adapted to these changes. Separating the optimal workflow from the concrete workflow allows the easy integration of Grid computing technology to address the resource allocation and security issues of data and computation resources. Allows the full or partial reuse of workflows defined at different levels. Reuse of a workflow may occur when users need to replicate their data sets or rerun the same workflow using different input data. For example, consider a scientist who is interested in a data set generated from a given workflow. Using the recorded data provenance, the corresponding concrete workflow that was used to generate this data set can be discovered. The concrete workflow can be reoptimized and invoked with different input data. The reuse of a workflow may also occur during workflow design. For example, a scientist may have a high level representation or partial representation of a workflow, searching the workflow repository may return a number of similar workflows at an abstract level and/or concrete level. This scientist may choose a candidate to reuse or modify the workflow to meet the goal. 6.3 An enhanced workflow system A general workflow system contains most of the components illustrated in the Figure 6.3 to support the semi-automated workflow composition process. 113

Ontologies serve as a common vocabulary for semantic annotation of services and data in the system. Semantics enabled service registry is responsible for storing the semantic and syntactic information of services as well as answering the inquiry. The semantic information can be provided by service providers or third party annotation. Workflow composer discovers appropriate services and resolving the connectivity between services. It is also responsible for converting the workflow into a workflow language that can be executed on a workflow engine. Data provenance management keeps track of the origination of the data products. Few workflow systems have the capability for the reuse of the knowledge gained during the service discovery, service composition, and service invocation process. We add two components knowledge discovery and knowledge management to the workflow system and discuss how this knowledge can be used over time to provide more accurate guidelines to users. As most current semantic web services standards are relatively mature and stable, the ontology model used in a system is built upon a distributed and modularized ontology structure and reuse some cross-domain ontologies such as Dublin Core (http://dublincore.org). The use of a well-defined ontology could potentially increase the interoperability for information published on the web. The ontology model used in a system normally contains two modules: generic service description ontology, such as OWL-S, is an ontology module used to specifies generic web service concepts including service inputs, outputs, preconditions, and effects; 114

Workflow Workflow execution execution engine engine concrete workflow Collect and manage information about data origination Data Data provenance provenance management management Workflow composer (software agent/experienced users) Knowledge Knowledge discovery discovery Semantics Semantics enabled enabled service service discovery discovery Service Service matchmaking matchmaking Knowledge Knowledge base base management management Find appropriate service Semantics enabled service registry Abstract workflow User Create abstract workflow using ontology DL DL reasoner reasoner Ontology Annotate services using ontology Service Annotator Figure 6.3. An enhanced workflow system with two added components, knowledge management and knowledge discovery service domain ontology is an ontology module designed and used for the semantic description of web services in a particular domain and normally represented with OWL-lite or OWL-DL. We give a definition of service in our system as a tuple with several important attributes: service i (description i, operation i,...) a service contains text descriptions of its feature, a set of operations (must not be ), and other attributes; operation ij (description ij, input ij, output ij, quality ij, performtask ij,...) an operation in a service contains text descriptions of its features, a set of input parameters (may be ), a set of output parameters (may be ), a set of quality metrics, semantic description of the features using vocabulary from service domain ontol- 115

ogy, and others; parameter k (semantic k, datatype k ) a parameter contains semantic description using vocabulary from service domain ontology and the data type. The semantic annotation of services and workflows can be represented as a RDF model and stored in a RDF repository. 6.3.1 Knowledge management The knowledge management component is responsible for collecting, analyzing, and handling inquiries on the knowledge base. The knowledge base holds information gathered incrementally during workflow translation and service composition processes. This information provides increasingly accurate guidelines for users over time. Four types of information are classified: - Connectivity of services. A concrete workflow can be viewed as a graph with a number of linked services in a certain order and logic. Each node in the workflow is an operation of an executable service. In a simple case, two nodes are connected if an output parameter of one operation maps an input parameter of another operation based on their syntactic and semantic description. operation ij operation mn Rule1 : if parameter k output ij and parameter o input mn and datatype(parameter o ) = datatype(parameter k ) and semantics(parameter o ) = semantics(parameter o ) Rule2 : if operation ij operation mn then service i service m 116

While the composability of services can be determined by these above simple rules, it can be identified using more complex models [55]. The connectivity between two services can be identified automatically based on the rule defined above when a new service is added to the system. It is a computationally intensive process when the number of services in the system and the number of parameters for each operation is large. Also, incorrectly identifying the connectivity between two services is most likely introduced by the misannotation of services or an incomplete ontological model. Therefore, during the translation process, the connectivity structure should be refined and updated based on human judgment. After a concrete workflow is created and verified, the connectivity of services in the workflow can be added into the system. As time goes by, the connectivity of services in a system forms a graph of the knowledge space. A vertex in the graph can be represented as (service i, opertation ij, parameter ijk ) or (service i, operation ij ) if one operation does not have parameters and the edge represents the connectivity of two vertices. - Alternativity of services. In the context of our research, we define service i as an alternative of service m if operation ij service i and operation mn service m their syntactic and semantic description are the same except the quality properties. For example, two services that implement the same WSDL interface are alternatives for each other. These two services may implement the WSDL interface using different underlying technologies, charging different fees, and having different performance. The execution of workflows and services takes place in a distributed computing environment. The execution may fail at some point due to the failure of 117

the workflow engine, failure of the service, and failure of the network fabric [64]. The capability to dynamically select alternative services ensures the recovery from service failure. The mygrid/taverna project provides users a way to encapsulate alternative services into the workflow at the design time. Another approach is to find an alternative service during run time using general semantic service discovery technologies. We believe that identifying and storing the alternatives of a service ahead of time can increase the performance by eliminating this semantic service discovery process. The method can also improve the correctness of finding alternative services. The alternativity of services can be automatically identified when a new service is added in the system and refined during the workflow translation process. The alternatives of a service i can be presented as a named property of a service alternativeof. The alternativeof property is a transitive property, which means that if service i is an alternative of service m, service m is an alternative of service x, then service i is an alternative of service x. - Quality profile of services. As more services with similar functionalities are published, it is important to define qualitative metrics that help the selection of the optimal services. Modeling the quality of service and approaches for choosing optimal services has been well studied for several years [10]. While there are a number of quality criteria that can be used for ranking services, different systems choose different sets of metrics and quality models for computing the overall quality of service. We define quality with four attributes. Quality(cost, trustness, executiontime, f ailurerate) cost is the fee needed to execute an operation ij and it is provided by the service provider; 118

trustness defines users preference of using the operation ij based on their experiences and it is annotated by users; executiontime and failurerate define the performance of an operation ij and they are collected and calculated from each run of a workflow or service. Other QoS properties, such as security, may also be added when needed. The overall quality of each service can be computed periodically or during the optimization process using the similar QoS computation model algorithm defined in [48]. - Mapping between abstract workflow and concrete workflow. The construction of the abstract workflow represents the knowledge that scientists knows about their domain and the services/tools provided in the system. The abstract workflow and the semantic annotation of the concrete workflow are represented using the ontology. The concrete workflow also is represented using a particular workflow specification that can be invoked on the workflow engine. Recording the mapping relationship between abstract workflow to concrete workflow enables finding similar workflows in the system given a workflow in different representation format. The concrete workflow can have its own semantic annotation. It can also be represented using the specific workflow language that can be invoked in the workflow engine. The knowledge about the connectivity of services, alternativity of services, quality of services, and workflow representations is typically stored in tables. 119

6.3.2 Knowledge discovery The Knowledge discovery component resides in the workflow composer. It is responsible for communicating between the workflow composer and the knowledge management component during the workflow translation process to find appropriate knowledge in the system. It is also responsible for selecting and replacing services with their optimal alternatives during the optimization process and to find a replacement during run time. The knowledge discovery component accepts and sends requests to the knowledge management component. 6.4 Translation process The process of translating abstract workflow into a concrete workflow involves the discovery of appropriate services and resolving the connectivity between services in order to accomplish tasks defined in the abstract workflow. 6.4.1 Service discovery and matchmaking process During the translation process, the workflow composer issues a query to find appropriate services that can be used to accomplish the defined task. For example, the composer is interested in finding an operation which performs the task aligning. We assume that one property of an operation is annotated using #performtask which is the vocabulary term defined in the OWL-based bioinformatics ontology of the mygrid project http://www.mygrid.org.uk/ontology. A general query returns all services having #performtask property equals #aligning. More sophisticated discovery processes use reasoning capabilities to infer a subsumption relationship between the requested service and the services described using the ontology. For example, there is an operation that has been annotated with 120

the property #performtask using the vocabulary #pairwise local aligning. In the ontology definition, the class #pairwise local aligning is not an asserted subclass but an inferred subclass of #aligning. With the subsumption reasoning, not only services annotated with #aligning should be returned, but also services annotated with #pairwise local aligning. The general translation from an abstract workflow to a concrete workflow requires solving the connectivity between two executable services with mismatched or inappropriate input to output. The mismatching problem may be introduced by inaccurate semantic annotations, incomplete semantic annotations, and inaccurate ontological reasoning (See Figure 6.4). One of the false positive examples is that a DDBJ-XML service with attached semantic annotation of its output as Sequence Data Record actually returns a document using self-defined format. The NCBI blast service with attached semantic annotation of its input as Sequence Data Record requires FASTA formatted sequence data. The connectivity of these two services is identified as positive but in fact is not. This type of error can be detected by expertise at design time or after the formed workflow runs and returns incorrect results. The true negtive case can be detected automatically during the translation time. Adaptor, shim, or mediator [74] technologies are used to align or modify poorly typed input and output of consecutive services in a workflow. These mediators are stored in mediator pools and discovering such a mediator is achieved with ontologies and machine reasoning, the same as the discovery of normal services. Most research has focused on how to discover these mediators using semantic web technology and machine reasoning. A general mediation process terminates in methods for translating the output of one web service into the input for the next. 121

Accurate annotation Yes Match Detection output No Inaccurate annotation Incomplete semantic annotation Inaccurate ontological reasoning Real match Yes No TP FP FN TN Inaccurate annotation Lack semantic annotation Inaccurate ontological reasoning May be detected by expertise at design time or after run Accurate annotation Can be detected automatically DDBJ-XML Out: sequence data record FP X Self-defined format NCBI blast In: sequence data record fasta format GenBankService Out:GenBank record TN X Mediator, adaptor, shim Blastp In: protein sequence 2007-4-18 Ph.D defense 71 Figure 6.4. The mismatching problem may be introduced due to the inaccurate annotation, incomplete semantic annotation, and inaccurate ontological reasoning during the translation process. 6.4.2 Knowledge reuse With the incrementally added information in the knowledge base, solving connectivity can be done completely at the syntax level without need for consulting the domain ontology. As time goes by, converting the abstract workflow to the concrete workflow may be achieved by finding a mediator between two services in the knowledge base. Thus, the use of ontologies will be exactly on those parts of the workflow that were never used before. The manual translation process will be 122

required just once for every new element of the set of components in a workflow and when a new service is added in the registry. The problem of solving the connectivity between two services can be converted to a problem of finding a path between two nodes in a connectivity graph. During the translation process, instead of resolving the connectivity from scratch using semantic reasoning technology, the composer can reuse stored knowledge to support the semi-automatic and automatic composition. 1. Given a service or operation, all services or operations connected to the current service or operation can be found by table lookup and presented to the users. Users can choose one based on their expertise. Since the connectivity stored in the table is verified during the previous workflow creation process, we expect the probability of finding an accurate one is higher and faster than using the semantic reasoning techniques from scratch. 2. Given two services or operations, find one or a sequence of services or operations between them (mediators) that can connect these two services or operations together. This problem can be converted to a problem of finding a path between the service or operation A to the service or operation B. Since the connectivity structure of services or operations in the knowledge base is a graph, the shortest path algorithm (Dijkstra) is applicable to this problem. 3. This concept can be extended into a wider use case when users know the exact input they can provide and output they are trying to get. A general planning technology is trying to find a service or operation that accepts this input and a service or operation that generates this output. Using the 123

connectivity structure, the path between the input and output can be found, if there is any. 6.4.3 Implementation and evaluation The connectivity between two services is identified automatically when a new service is registered into the semantic-enabled registry using the matching rules defined in the Section 6.3. As more services are registered in the registry, the connectivity graph is formed. Since the automatic identification process may introduce some mismatching problems, the mismatching cases can be corrected during the workflow translation process with knowledge from experts (See Figure 6.5). During the workflow translation process, the knowledge discovery component can find the path between two services/operations and suggest the next available services/operations by searching the knowledge base at syntactic level. The searching function is implemented with the Dijkstra algorithm. Registration process registry Automatically Identify the connectivity Workflow Translation / Service composition process Store the connectivity Refine, update, decompose the workflow Knowledge base Figure 6.5. The creation process of connectivity graph when a new service is added in the registry, the connectivity is refined and updated during the workflow translation process. 2007-4-18 Ph.D defense 72 124

TABLE 6.1 PERFORMANCE EVALUATION OF MATCH DETECTION PROCESS Number of Services Number Matched Pairs of Load RDF repository (milliseconds) Average time of match detection per single service (milliseconds) 200 10 1547 12.02 400 34 2346 13.01 600 84 2600 12.31 800 138 3015 12.35 1000 225 3325 12.51 The connectivity graph approach is evaluated on an Dell laptop with a 1.5GHz Pentium M CPU and 512M of RAM. Service decriptions are randomly generated using 418 concepts from domain ontology (mygrid and MoGServ) for semantic type and defined 10 concepts for data type. Each service contains 1 operation. Each operation has 1 input and 1 output. The measured performance of the match detection process during the service registration process is reported in Table 6.1. The number of matched pairs reports the identified pair of services that one service s output can be fed as input for the other service. Although the time to load the RDF repository (Sesame) increases as the number of generated services increases, the process is typically done once. The average time of the matching process when a new service is registered in the repository requires about 12-13 milliseconds. The searching function of shortest path algorithm is evaluated using the connectivity graph created from 1000 randomly generated semantic web services. The 125

TABLE 6.2 PERFORMANCE EVALUATION OF PATH SEARCHING PROCESS Number of nodes Number of arcs Average path search time (milliseconds) Connectivity graph load time (milliseconds) 724 587 Less than 1 220 graph is formed with matched pair and input/output with each service in matched pair. The measured performance of the path finding process is reported in Table 6.2. Loading the connectivity graph is typically done once. The average path search time is less than 1 milliseconds. The longest path between two nodes has 9 additional nodes. The preliminary results for testing the feasiblility of our implementation in terms of performance is acceptable. Further testing with real services and workflows is needed to fully testing our approach. 6.5 Workflow reuse Both abstract worklow, and concrete workflow can be viewed as a graph. With this type of graph represenation, graph matching techniques can be applied to find similar workflows in the system. Although in-depth graph theoretic research is not the main focus of this investigation, we are interested in applying an efficient algorithm to find similar workflows in the system given the graph representation of abstract workflow or concrete workflow. SUBDUE (available at http://cygnus.uta.edu/subdue/) is a graph-based knowledge discovery system that finds structural and relational patterns in data rep- 126

resenting entities and relationships. SUBDUE represents data using a labeled, directed graph in which entities are represented by labeled vertices or subgraphs, and relationships are represented by labeled edges between the entities. The SUB- DUE graph match utility [18] is a part of the SUBDUE data mining system. The graph match utility can perform exact and inexact graph matches on directed or undirected graphs with labeled vertices and edges. It solves the graph isomorphism problem which is defined as: given two graphs G1 and G2, is it possible to permute (or relabel) the vertices of one graph so that it is equivalent to the other. For example, a scientist may have a scientific process in her mind such as: I d like to get all ATP alpha units of plastids in my MoG investigation and do mutliple sequence alignments and get an alignment report with a format that I am able to feed into my local PAUP program. A possible abstract workflow she may define is similar to Figure 6.6. The given workflow is converted to the graph representation that can be fed into the match algorithm. The match algorithm computes the similarity of the given workflow against all the workflows stored in the knowledge base. The returned match cost from the SUBDUE algorithm is the measurement that we use to rank the similarity of the workflows. If two graphs are identical, the match cost is 0. Costs of various graph match transformation have effects on the results. The costs can be changed based on the importance of each transformation. For example, we might like to define that the cost of substituting a vertex label or edge label is higher than the cost of deleting the vertex or edge. With this specified, the algorithm can find more optimal results. The threshold of returned workflows is defined based on the match cost. One or more workflows with the most similarity are returned and presented to users. Users 127

query_term Graph view SUBDUE input format hasparameter input hasinput task performtask hasnext retrieving task performtask hasoutput output aligning hasparameter v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasnext e 3 1 hasinput e 4 2 hasoutput e 3 6 performtask e 4 7 performtask e 1 5 hasparameter e 2 8 hasparameter multiple_alignment_report Figure 6.6. The graph representation of a workflow for describing a scientific process may decide to use these workflows as a template to manipulate their workflow definition on the abstract level or concrete workflow level. Alternatively, users may decide to use the returned workflow to conduct their experiments. 6.6 Related work Abstract and concrete workflows have been introduced in various scientific workflow literature and systems [22, 24, 73]. These two representations create a view of certain aspects of a workflow that meet the interests of users with different knowledge levels of the services and a particular domain. However, in these systems and literature, the notion of concrete workflow and optimal workflow are 128

combined together and are often not distinguished as two separate representations. We believe that separating these two workflow representations provides flexibility of dynamic binding, the ability to select optimal services, and easier integration of Grid resource management services. The translation of an abstract workflow into a concrete workflow is a process of service discovery and service composition. It normally uses an ontology to annotate services and applies reasoning and matchmaking technologies from a workflow. A number of research investigations focus on automation of this process and assume that the ontological model is well defined and services are correctly annotated, which is not always the case. Rao et. al. [76] presents an approach that addresses the reality of incomplete annotation. The framework helps users become better at annotating composable functionality over time. The enhance system and methodology proposed in this paper is intended to reuse the knowledge that has been verified by others. It provides users more accurate guidance for service discovery using that stored knowledge. The importance of reuse and repurposing of workflows has been reported in [37, 104]. Antoon Goderis et. al. [38] presents an approach of using graph based solution to find similar concrete workflow on the web. This is an similar approach similar to what we used but with a different graph matching algorithm and different graph representations. 6.7 Conclusion and future Work In this chapter, we present the importance of implementing workflow and knowledge reuse. In order to support that reuse, we propose and describe a methodology and an enhanced workflow system. In includes a hierarchical work- 129

flow structure consisting of four levels that allow users to specify workflows at different levels of abstraction, based on their knowledge and experience. Two components are added into the workflow system to collect and analyze the reusable of products from the workflow translation process when new services are added in the system. The methodology proposed is being used in the design and implementation of a service-oriented based system for supporting bioinformatics research. Based on its successful design and implementations of the system (MoGServ) [110], we developed an ontological model for data and services annotation in the system. At the current stage, the number of services, operations, workflows in the system is relatively small, but are expected to grow with usage. The future MoGServ is intended to support genomic research and provide a workbench for biologists in the Indiana Center for Insect Genomics (ICIG) 1, a research center composed of three academic institutional partners. Users can define a genomic research workflow through a web interface for a particular application. It may result in higher productivity for genomics researchers and synergy resulting from transparent integration of data and analysis tools from multiple locations. We believe that the enhanced workflow system with the knowledge reuse capability can provide more accurate guidelines during the workflow creation process and make the process more efficient. A systemic evaluation is being conducted. 1 http://ctdrt.bio.nd.edu/index.php?content=projectinfo.php&projectno=4 130

CHAPTER 7 SUMMARY AND FUTURE WORKS 7.1 Summary In this dissertation, we present a practical experiment of building a serviceoriented system upon current web services technologies and bioinformatics middleware. The first prototype of this system integerates data and services from other service providers. It is being evaluated on a phylogenetic research application, Mother of Green (MoG). Our evaluation demonstrates that a service-oriented architecture can accelerate scientific research, increase research productivity, and provide a new approach to doing science. Based on the successful design and implementation of this prototype, we present an enhanced system with semantic annotation of services and data. The enhancement aims at allowing life science researchers to define their experiments at different levels based on their knowledge of the tools, data, and the system. The semantically enriched data allows easier reuse, sharing, and experiments involving search to be conducted. Few of current practical methodologies and workflow systems for service composition and workflow creation in e-science consider the potential for reuse: to share the knowledge gained during the service composition process and to reuse complete or partial of existing workflows. We believe that providing a capability 131

for reuse of this knowledge and workflows could be an important component in a workflow system. We propose a methodology and an enhanced system design to facilitate the reuse of knowledge and workflows. It contains a hierarchical workflow structure representation, knowledge management and knowledge discovery components to capture and manage the reusable knowledge in a system, and an approach for using a graph matching algorithm to discover similar workflows. 7.2 Limitations and future work The future MoGServ is intended to support genomic research and provide a workbench for biologists in the Indiana Center for Insect genomics (ICIG). The ICIG includes three partners, University of Notre Dame, University of Purdue, and Indiana University. The future MoGServ can help a user at the ICIG site discover data or computational web services that are available at the site, other ICIG partner s locations, or elsewhere on the web. There are several limitations of the initial MoG implementation we discussed in Chapter 3 in order to use the system across mutiple sites. These limitations include security, resource management, and end-user oriented workflow creation. Several improvements and theoretical approaches are described in Chpater 5 and Chapter 6, however, there are still more work need to be done. The future work may be conduct from several aspects. Integration of GridSAM. We will explore a way to integrate the MoG system with a grid computing architecture such that the security issue, resource allocation, and resource management can be shift to using the existing grid computing technologies. In the MoGServ implementation, we have a simple resource management mechanism implemented by two components, job manager and job luncher. A better sophisticated mechanisms can be used 132

to integrate into the MoGServ system. The GridSAM 1 Web Service is a WS-I compliant Web Service implementation of the GridSAM service interface as well as the upcoming Global Grid Forum Basic Execution Service interface. It integrates with the GridSAM Core Engine to provide remote job launching and file staging capability as described in a Job Submission Description Language document. As a new feature introduced to GridSAM 2.0.0, a sophisticated authorization mechanism provides a powerful capability to control incoming service requests on a user/group basis. Enhancement of user interface. In the current MoGServ, the data model design has table to keep the personlize information for individual user. An authorization component should be built in the system to enable users to access the permitted services and to personalize their own workspace. A web portal will be built to enable users to create an account, login and logout with username and password. The user account information including the access level will be stored in a database. The GridSphere portal framework [39], an open-source portlet based web portal, is one of the candidates. Enhancement of data annotation and ontological model. The current ontological model captures the meta data that users need to query their data provenance. As the system is used more and more, the ontological model may need to be updated and add new properties and concepts in. Integration of presented new functionalities into the system. In this dissertation research work, we present a new approach to improve the reuse workflows and their by products as well as a heireachical workflow struc- 1 http://gridsam.sourceforge.net/2.0.0/index.html 133

ture. The future work is adding these funcationailities into the system by developing an easy-use interface for users to define workflows at multiple levels; allow users choose similar workflows and manipulate the workflow as desired. 134

APPENDIX A GLOSSARY BPEL4WS Business Process Execution Language for Web Services provides a language for the formal specification of business processes and business interaction protocols. BLAST Basic Local Alignment Search Tool algorithm is used to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. ClustalW a tools for global multiple alignment of DNA and protein sequences. FASTA a common sequence format that begins with a single-line description followed by lines of sequence data. HGT Horizontal gene transfer is a process in which an organism transfers genetic material to another cell that is not its offspring. HGT occurs outside of the mechanisms of Mendelian genetics, crossing over species, order and family reproductive barriers. J2EE Java 2 Platform, Enterprise Edition defines the standard for developing component-based multitier enterprise applications. JXTA is an open source peer-to-peer platform created by Sun Microsystems in 2001. LGT Lateral gene transfer is a process in which an organism transfers genetic material to another cell that is not its offspring. LGT occurs within the cell, from endosymbiont genomes to the host cell nucleus. MoG Mother of Green is a collaborative research project on plastid phylogenetic analysis involving information technologists and biologist. MoGServ A service-oriented system for data integration and data analysis for phylogenetic analysis. 135

NEXUS The NEXUS format was designed by David Maddison, Wayne Maddison, and David Swofford to facilitate the interchange of input files between programs used in phylogeny and classification. OGSA Open Grid Services Architecture OWL Web Ontology Language OWL-S the ontology description of web service using OWL. PAUP a progarm for phylogenetic analysis using parsimony, maximum likelihood, and distance methods. Phylip a set of modular program for performing numerous types of phylogenetic analysis Phylogeny also called phylogenesis is the origin and evolution of a group of organism. Phylogenetics an area of study the evolutionary relationship among various groups of organisms. RDF Resource Description Framework is the basic standard for knowledge sharing and reuse in semantic web REST Representational State Transfer is a term coined by Roy Fielding to describe an architecture style of networked systems. SAM Sequence Alignment and Modeling System. SOA Service-oriented architecture SOAP Simple Object Access Protocol is a protocol for exchanging messages among requesters and providers. SOC Service-oriented computing UDDI Universal Description, Discovery and Integration provides a standard registry for publishing, discovery and reuse of web services. WSDL Web Service Description Language defines the abstract interface of services. WS-I an open industry organization chaptered to promote Web services interoperability, creates, promotes and supports generic protocols for the interoperable exchange of messages between Web services. WSRF Web Services Resource Framework 136

XML Extensible Markup Language XSLT XSL Transformations is a language for transforming XML documents into other XML documents. SL specifies the styling of an XML document by using XSLT to describe how the document is transformed into another XML document that uses the formatting vocabulary. A.1 Pictures Figure A.1. Time line for the origin of life and major invasions giving rise to mitochondria and plastids.[27] 137

Figure A.2. Gene transfer to the nucleus. [27] 138

Figure A.3. Symbioses process [69] 139

Figure A.4. ATP Synthase: the wheel that powers life. It is a candidate for ascertainment of deep phylogeny. 140

APPENDIX B MOGSERV MANUAL B.1 Main MoGServ is accessible through URL, http://almond.cse.nd.edu:10000/bioinfor1. If you are inside the ND network, you may access another host of MoGServ from http://biocomp.science.nd.edu:8080/mog. (See Figure B.1) B.2 Retrieve genome and gene data from NCBI database Data collection service retrieves complete genome squences and gene sequences using terms defined by users. Retrieved sequences are stored into the local database. The service is executed weekly during weekend or daily during night to update the database. See Figure B.2. B.3 Query local database This service allows users to create gene sequence sets or genome sequence sets by querying the local database. The meta data of these sequences are indexed using Lucene index and search engine. The valid query can be chlo*, ATP and atp, and so on, which follows the Lucene syntax. Users input their query and choose either gene or complete genome sequence. A set of sequences is returned. Users can examine the set and delete sequences from this set. Then 141

Figure B.1. The main menu of the MoGServ users can choose either create new set or add to an existing set. create new set puts these sequences together in order to do sequence alignment. A set id returns to users for further reference. add to an existing set puts these sequences into an existing set (id is input by users). See Figure B.3 B.4. Users can also download these sets with different formats. B.4 Set management Users can upload a set of sequences in fasta format to the local database. These sequences can be from users own lab experiments, which may not be ready to submit to the public database. They can also be a small number of sequences 142

Figure B.2. A web interface provides users a way to define data with interests. not in the local database at that time. These sequences are annotated using the appropriate metadata description. See Figure B.5. Users can query the information of a set as shown in Figure B.6, such as the creation date, the origination of the set, etc. Users can also use the set filter service to find the intersection of organisms among multiple sets. See Figure B.7. B.5 Data analysis services The MoGServ system provides 7 data analysis services: blastn, blastp, blastx, tblastn, tblastx, MegaBLAST, ClustalW. In order to use blast and megablast to do sequence alignment, users need to 143

Figure B.3. Input the query term from this interface and choose gene or genome database input two sequence sets: base set and compare set. A base set is a set of sequence that is similar to the database field in NCBI blast search website. A compare set is a set of sequence that is compare against a base set. It is similar to the search field in NCBI blast search website. Base sets and compare sets need to be created using Query Local or Set managment services. Users can define a few parameters, such as e-value, window size, and so on. A job id will be returned and shown on the browser. Users should record this id number for further reference. When the task is executed, required sequences are retrieved from local database and input to blast (megablast) program. Comparison results are stored in the local file system for downloading. See Figure B.8 shows the tblastn service. 144

In order to use ClustW service, users need to define the set id and the sequence type. See Figure B.9. The job id is returned for further reference. B.6 Job mangement This service allows users to query the job information and monitor the execution status of their submitted jobs. There are three execution status, submit, start, and finish. output becomes hot link when execution status turns to finish. Users can follow the link to view the input, output of each data analysis job. See Figure B.10 B.11 B.12. 145

Figure B.4. The results from querying local database 146

Figure B.5. Users may copy, past particular sequences and upload to the local database 147

Figure B.6. Set information 148

Figure B.7. The set filter service is used to find intersection of organisms among mutliple sets. 149

Figure B.8. tblastn interface in MoGServ 150

Figure B.9. ClustalW Interface in MoGServ 151

Figure B.10. Job management interface shows the status, input link, output link of a job 152

Figure B.11. An example input of a clustalw analysis, set id is a hot link, users can view sequence information in this set. 153

Figure B.12. An example output of a clustalw analysis, users can download, convert, view the results. 154

APPENDIX C DEVELOPMENT AND DEPLOYMENT TOOLKITS Some development and deployment toolkits we used for the implmentation are listed in Table C.1. All the software packages are open source and can be download from the URL. 155

TABLE C.1 OPEN SOURCE SOFTWARE PACKAGES USED FOR DEVELOPMENT AND DEPLOYMENT Packages Version Descriptions URL Apache Axis1 axis-1 2RC2 a SOAP engine for developing and hosting web services http://ws.apache.org/axis/ Tomcat jakarta-tomcat- 5.0.18 J2EE compliant servlet container http://tomcat.apache.org/ Taverna 1.4 a GUI based workbench for creating, execution, and monitoring workflows http://taverna.sourceforge.net/ Apache Lucene 1.4.3 a high-performance, full-featured text search engine library written in Java http://lucene.apache.org/java/docs PostgresSQL 8.0.3 a relational database system http://www.postgresql.org/ Protege 3.2 an ontology editor and knowledge-base framework with OWL supports http://protege.stanford.edu/ Pellet 1.3-beta2 an open-source Java based OWL DL reasoner http://pellet.owldl.com/ sesame 1.2.6 an open source RDF framework with support for RDF Schema inferencing and querying http://www.openrdf.org/about.jsp subdue 5.1.4 a graph-based data mining system http://cygnus.uta.edu/subdue/ 156

APPENDIX D SUPPLEMENTARY MATERIAL FOR CHAPTER 3 AND CHAPTER 4 D.1 Complete genome sequence in XML <?xml version="1.0"?> <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/insd_insdseq.dtd"> <INSDSeq> <INSDSeq_locus>NC_005042</INSDSeq_locus> <INSDSeq_length>1751080</INSDSeq_length> <INSDSeq_strandedness>double</INSDSeq_strandedness> <INSDSeq_moltype>DNA</INSDSeq_moltype> <INSDSeq_topology>circular</INSDSeq_topology> <INSDSeq_division>BCT</INSDSeq_division> <INSDSeq_update-date>24-JUL-2006</INSDSeq_update-date> <INSDSeq_create-date>25-JUL-2003</INSDSeq_create-date> <INSDSeq_definition>Prochlorococcus marinus subsp. marinus str. CCMP1375, complete genome</insdseq_definition> <INSDSeq_primary-accession>NC_005042</INSDSeq_primary-accession> <INSDSeq_accession-version>NC_005042.1</INSDSeq_accession-version> <INSDSeq_other-seqids> <INSDSeqid>ref NC_005042.1 </INSDSeqid> <INSDSeqid>gnl NCBI_GENOMES 310</INSDSeqid> <INSDSeqid>gi 33239452</INSDSeqid> </INSDSeq_other-seqids> <INSDSeq_project>419</INSDSeq_project> <INSDSeq_source>Prochlorococcus marinus subsp. marinus str. CCMP1375 (Prochlorococcus marinus SS120)</INSDSeq_source> <INSDSeq_organism>Prochlorococcus marinus subsp. marinus str. CCMP1375 </INSDSeq_organism> <INSDSeq_taxonomy>Bacteria; Cyanobacteria; Prochlorales; Prochlorococcaceae; Prochlorococcus</INSDSeq_taxonomy>... <INSDSeq_feature-table>... <INSDFeature> <INSDFeature_key>CDS</INSDFeature_key> <INSDFeature_location>1447640..1449106</INSDFeature_location> 157

<INSDFeature_intervals> <INSDInterval> <INSDInterval_from>1447640</INSDInterval_from> <INSDInterval_to>1449106</INSDInterval_to> <INSDInterval_accession>NC_005042.1</INSDInterval_accession> </INSDInterval> </INSDFeature_intervals> <INSDFeature_quals> <INSDQualifier> <INSDQualifier_name>gene</INSDQualifier_name> <INSDQualifier_value>atpD</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>locus_tag</INSDQualifier_name> <INSDQualifier_value>Pro1591</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>note</INSDQualifier_name> <INSDQualifier_value>Produces ATP from ADP in the presence of a proton gradient across the membrane. The beta chain is a regulatory subunit </INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>codon_start</INSDQualifier_name> <INSDQualifier_value>1</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>transl_table</INSDQualifier_name> <INSDQualifier_value>11</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>product</INSDQualifier_name> <INSDQualifier_value>ATP synthase subunit B</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>protein_id</INSDQualifier_name> <INSDQualifier_value>NP_875982.1</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>db_xref</INSDQualifier_name> <INSDQualifier_value>GI:33241040</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>db_xref</INSDQualifier_name> <INSDQualifier_value>GeneID:1462973</INSDQualifier_value> </INSDQualifier> <INSDQualifier> <INSDQualifier_name>translation</INSDQualifier_name> <INSDQualifier_value>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGK NPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIF 158

NVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFG GAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKV ALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGR MPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARA LAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRR TVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEV KEKAQKISADAKK</INSDQualifier_value> </INSDQualifier> </INSDFeature_quals> </INSDFeature>... </INSDSeq_feature-table> <INSDSeq_sequence>... </INSDSeq_sequence> </INSDSeq> The size of this example XML file is about 7.7M, the size of the complete genome sequence in fasta format is about 1.7M. Actual length of this sequence is 1751080 nt. D.2 Example of a ATP synthase subunit B sequence Fasta format: >gi 33241040 ref NP_875982.1 ATP synthase subunit B [Prochlorococcus marinus subsp. marinus str. CCMP1375] MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQDVALTAEVQQLLGDHRVRAVA MSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKV FETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKE SGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSAL LGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAA KGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIE KFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK TinySeq XML: <?xml version="1.0"?> <!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/ncbi_tseq.dtd"> <TSeq> <TSeq_seqtype value="protein"/> <TSeq_gi>33241040</TSeq_gi> <TSeq_accver>NP_875982.1</TSeq_accver> <TSeq_sid>gnl REF_uproscoff Pro1591</TSeq_sid> <TSeq_taxid>167539</TSeq_taxid> <TSeq_orgname>Prochlorococcus marinus subsp. marinus str. CCMP1375 159

</TSeq_orgname> <TSeq_defline>ATP synthase subunit B [Prochlorococcus marinus subsp. marinus str. CCMP1375]</TSeq_defline> <TSeq_length>488</TSeq_length> <TSeq_sequence>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQD VALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDE QGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQ ELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGA RMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVG ELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTS TMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFF VAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK</TSeq_sequence> </TSeq> There are total 182 complete genome sequences in the database and 878 ATP gene sequences. D.3 Protein name For each whole genome sequence find all of the proteins that make up ATP synthase (See table D.1) D.4 Syntax of search local database Local database is indexed using Lucene search engine. Refer to 1 for complete syntax description. There are two tables to store complete genome sequence and gene sequences respectively. D.2 list syntax and example for searching these database. D.3 summarize the field when we create index. D.5 Workflow of retrieve sequence Since we use web services provided by NCBI to retrieve the sequences, there may have failure during data collection process. Record the status of data retrieve 1 http://lucene.apache.org/java/docs/queryparsersyntax.html 160

TABLE D.1 NAME OF ATP SYNTHASE Protein name atpc description gamma chain atp1 protein 1 atpi atph chain a subunit c atpg chain b atpf atpd atpa atpb atpe N/A ch1m ftrc chain b delta chain alpha chain beta subunit epsilon subunit ATP synthase Mg-protoporphyrin IX methyl transferase ferredoxin-thioredoxin reductase, catalytic chain TABLE D.2 SYNTAX OF SEARCHING LOCAL DATABASE Query type single words phrase field boolean grouping Example cyanobacteria ATP synthase name:atp AND gamma AND plastid atpa NOT bacteria atpa AND (plastid or cyanobacteria) 161

TABLE D.3 INDEXING FIELD OF LOCAL DATABASE Field gi accver name term taxonomy cds nucleotide gi nucleotide name default Comments gi number of the sequence accver number of the sequence name of the sequence query defined by users and used to get this sequence from NCBI taxonomy of the sequence provided by NCBI name of protein that make up atp synthase (only in gene table) gi number of corresponding nucleotide gi which is also the gi from the complete genome (only in gene table) name of corresponding nucleotide sequence(only in gene table) the default field contains all the information described above, without specify the field name in the database enables us to examine the integrity of the data. Parse the XML file requires huge memory. Detail with the redundance of the sequence, but record the query term. Update the database weekly or daily. Psudo code for retrieving complete genome sequence: get search term from ncbi_retrieve table for each term get sequence in fasta format set retrieve_gene_status as ready Psudo code for retrieving gene sequence: get acceid from ncbi_genomes table where retrieve_gene_status is ready for each acceid update retrieve_gene_status as start in ncbi_genome table get sequence in GB XML format parse the XML to get particular protein sequence acceid use acceid to get protein sequence in fasta format compute the correspond nucleotide sequence get taxonomy of the sequence 162

update retrieve_gene_status as start in ncbi_genome table update the taxonomy for the sequence in ncbi_genome table D.6 ClustalW input An example of the ClustalW input file: <?xml version= 1.0 encoding= utf-8?> <inputparams> <setid>142</setid> <sequencetype>nucleotid</sequencetype> <title>sequence</title> <topdiags></topdiags> <alignment>full</alignment> <window></window> <gapext></gapext> <outputtree></outputtree> <output>aln1</output> <tossgaps>true</tossgaps> <ktup></ktup> <kimura>true</kimura> <matrix>blosum</matrix> <scores>percent</scores> <outorder>aligned</outorder> <gapopen></gapopen> <gapclose></gapclose> <gapdist></gapdist> <pairgap></pairgap> </inputparams> An example of the ClustW output file: <?xml version= 1.0 encoding= utf-8?> <output> <title>sequence</title> <ebiid>clustalw-20060925-04170320</ebiid> <file>clustalw-20060925-04170320.txt</file> <file>clustalw-20060925-04170320.aln</file> <file>clustalw-20060925-04170320.dnd</file> </output> D.7 Blast An example blastn input: <?xml version= 1.0 encoding= utf-8?> 163

<inputparams> <expect>10</expect> <wordsize>11</wordsize> <matrix></matrix> <opengap></opengap> <extendgap></extendgap> <searchsetid>130</searchsetid> <searchsettype>gene</searchsettype> <searchseqtype>nucleotide</searchseqtype> <dbsetid>130</dbsetid> <dbsettype>gene</dbsettype> <dbseqtype>nucleotide</dbseqtype> </inputparams> D.8 PAUP The result generated from ClustalW program is convert to NEXUS format from the web interface (see Figure B.12). The data conversion is done with the service provided in the system. Here is portion of the NEXUS file format for all ATP beta unit. #NEXUS BEGIN DATA; DIMENSIONS NTAX=27 NCHAR=1503; FORMAT DATATYPE=DNA INTERLEAVE MISSING=-; [Name: Saccharum1 Len: 1503 Check: 0] [Name: Saccharum2 Len: 1503 Check: 0] [Name: Zea_mays Len: 1503 Check: 0] [Name: Triticum_a Len: 1503 Check: 0]... MATRIX Saccharum1 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCA CAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGG Saccharum2 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCA CAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGG Zea_mays ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGATTTCCA CAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGG... Calycanthu TGA Pinus_kora --- Pinus_thun --- Marchantia --- Physcomitr --- 164

Anthoceros --- Huperzia_l --- ; END; Here is the configuration file used for PAUP to generate phylogenetic tree using the NEXUS file: #NEXUS begin paup; set autoclose=yes warntree=no warnreset=no; log start file=thisfile.log replace; execute atpb_27.nex; Set criterion=distance; dset dist = hky85; showdist; nj; nj breakties = random; bootstrap nreps=100 brlens=yes keepall=yes search=heuristic; savetrees from=1 to=1 savebootp=both maxdecimals=0; contree all/strict=no file=thisfilename.tre replace showtree=yes; end; Figure D.1 and D.2 show generated tree results. 165

Figure D.1. Phylogenetic tree generated from the PAUP 166

Figure D.2. Phylogenetic tree file generated from the PAUP can be viewed by other program 167

APPENDIX E SUPPLEMENTARY MATERIAL FOR CHAPTER 5 AND CHAPTER 6 This is a sample output of comparing two workflows using SUBDUE. The inexact graph match program computes the cost of transforming the larger of the input graphs into the smaller according to the transformation costs predefined cost. The program returns this cost and the mapping of vertices in the larger graph to vertices in the smaller graph. The smaller match cost represents the higher structural similarity between two workflows. // Costs of various graph match transformations #define INSERT_VERTEX_COST 1.0 // insert vertex #define DELETE_VERTEX_COST 1.0 // delete vertex #define SUBSTITUTE_VERTEX_LABEL_COST 1.0 // substitute vertex label #define INSERT_EDGE_COST 1.0 // insert edge #define INSERT_EDGE_WITH_VERTEX_COST 1.0 // insert edge with vertex #define DELETE_EDGE_COST 1.0 // delete edge #define DELETE_EDGE_WITH_VERTEX_COST 1.0 // delete edge with vertex #define SUBSTITUTE_EDGE_LABEL_COST 1.0 // substitute edge label #define SUBSTITUTE_EDGE_DIRECTION_COST 1.0 // change directedness of edge #define REVERSE_EDGE_DIRECTION_COST 1.0 // change direction of directed edge [xxiang1@localhost subdue-5.1.4]$ bin/gm graphs/graph1.g graphs/mytest1.g Match Cost = 15.000000 Mapping (vertices of larger graph to smaller): 1 -> deleted 2 -> 3 3 -> 1 4 -> 2 5 -> deleted 6 -> deleted 7 -> 4 8 -> deleted [xxiang1@localhost subdue-5.1.4]$ 168

An example of WSDL description for the service provided in the MoGServ (See Figure: E.1). Create a workflow using Taverna workbench (See Figure: E.2). XScufl format (See Figure: E.3). Sample data annotation in rdf format displayed with RDF Gravity 1 (See Figure: E.4). Sample service annotation in rdf format displayed with RDF Gravity 2 (See Figure: E.5). 1 http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html 2 http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html 169

Figure E.1. This is the WSDL description of QueryLocal service hosted in the MoGServ, which provides an operation to create a set in the local database. This operation accepts two parameters and return the set id. 170

Figure E.2. One example of using Taverna workbench to create, test, and run workflow. This workflow accepts users input, search the local database, create set, align set using ClustalW, convert the ClustalW result to NEXUS format, which can be fed to PAUP. 171

Figure E.3. XScufl workflow format represents the workflow created using the Taverna workbench. 172

Figure E.4. Annotation of job and set information using ontological model defined. The sample rdf file is displayed using RDF Gravity. 173

Figure E.5. Annotation of a service using ontological model defined. The sample rdf file is displayed using RDF Gravity. 174

BIBLIOGRAPHY 1. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. J. Mol. Biol., 215(3):403 10, 1990. 2. K. Amin, G. von Laszewski, M. Hategan, N. J. Zaluzec, S. Hampton, and A. Rossi. Gridant: A client-controllable grid workflow system. In Proceedings of the 37th Hawaii International Conference on System Science, 2004. 3. Axis. Apache axis, apache software foundation. URL http://ws.apache. org/axis. 4. BEANSHELL. Light weight scripts for java. URL http://www.beanshell. org/. 5. K. A. Beiter and K. Ishii. Integration producibility and product performance tools within a web-service environment. In ASME 2003 design engineering technical conferences and computers and information in engineering conference, 2003. 6. B. Benatallah, M. Dumas, Q. Z. Sheng, and A. H. Ngu. Declarative composition and peer-to-peer provisioning of dynamic web services. In Proceedings of the 18th International Conference on Data Engineering (ICDE 02), 2002. 7. T. Berners-Lee, J. Hedler, and O. Lassila. The semantic web. Scientific American, May 2001. 8. T. Berners-Lee, W. Hall, J. Hendler, N. Shadbolt, and D. J. Weitzner. Creating a science of the web. Science, 313(5788):769 771, August 2006. 9. BIOWBI. Bioinformatic workflow builder interface (biowbi). URL http: //www.alphaworks.ibm.com/tech/biowbi. 10. P. A. Bonatti and P. Festa. On optimal service selection. In Proceedings of the 14th international conference on World Wide Web, 2005. 11. BPWS4J. The ibm business process execution language for web service java run time. URL http://www.alphaworks.ibm.com/tech/bpws4j. 175

12. D. Buttler, M. Coleman, T. Critchlow, R. Fileto, W. Han, C. Pu, D. Rocco, and L. Xiong. Querying multiple bioinformatics information sources: Can semantic web research help? SIGMOD Record, 31(4):59 64, 2002. 13. M. Carman, L. Serafini, and P. Traverso. Web service composition as planning. In ICAPS 2003 Workshop on Planning for Web Services, 2003. 14. S. Carrere and J. Gouzy. Remora: a pilot in the ocean of biomoby webservices. Bioinformatics, 22(7), 2006. 15. S. Chirstley, X. Xiang, and G. Madey. An ontology for agent-based modeling and simulation. In Agent 2004 Conference, 2004. 16. M. Clamp, J. Cuff, S. M. Searle, and G. J. Barton. The jalview java alignment editor. Bioinformatics, 20(3):426 7, 2004. 17. Collaxa. Collaxa BPEL server. URL http://www.collaxa.com/. 18. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32 41, 2000. 19. J. Day and R. Deters. Selecting the best web service. In Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, pages 293 307, 2004. 20. R. de Knikker, Y. Guo, J. long Li, A. K. Kwan, K. Y. Yip, D. W. Cheung, and K.-H. Cheung. A web services choreography scenario for interoperating bioinformatics applications. BMC Bioinformatics, 5(25), 2004. 21. D. de Roure, N. R. Jennings, and N. Shadbolt. The semantic grid: Past, present and future. Proc. of the IEEE, 93(3), March 2005. 22. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1), 2003. 23. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su1, K. Vahi, and M. Livny. Grid Computing, volume 3165/2004 of Lecture Notes in Computer Science, chapter Pegasus: Mapping Scientific Workflows onto the Grid, pages 11 20. Springer Berlin / Heidelberg, 2004. 24. L. A. Digiampietri, C. B. Medeiros, and J. C. Setubal. A framework based on web service orchestration for bioinformatics workflow management. Genetics and Molecular Research, 4(3):535 542, 2005. 176

25. A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learning to match ontologies on the semantic web. The VLDB Journal (The international Journal on Very large Data bases, 12, 2003. 26. A. Dogac, Y. Kabak, G. Laleci, S. Sinir, A. Yildiz, S. Kirbas, and Y. Gurcan. Semantically enriched web services for the travel industry. SIGMOD Record, 33(3), 2004. 27. S. D. Dyall, M. T. Brown, and P. J. Johnson. Ancient invasions: From endosymbionts to organelles. SCIENCE, 304(9), April 2004. 28. I. Elgedawy, Z. Tari, and M. Winikoff. Exact functional context matching for web services. In ICSOC 04, 2004. 29. V. Ermolayev, N. Keberle, O. Kononenko, S. Plaksin, and V. Terziyan. Towards a framework for agent-based semantic web service composition. International Journal of Web Service Research, 2004. 30. ETTK. Emerging technologies toolkit. URL http://www.alphaworks.ibm. com/tech/wssem. 31. N. M. Fast, J. C. Kissinger, D. S. Roos, and P. J. Keeling. Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids. Mol. Biol. Evol., 18(3):418 426, 2001. 32. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. Lecture Notes in Computer Science, 2150, 2001. 33. K. Garwood, P. Lord, H. Parkinson, N. Paton, and C. Goble. Pedro ontology services: A framework for rapid ontology markup. In Proc of 2nd European Semantic Web Conference, pages 578 591. Springer Verlag, 2005. 34. Y. Gil, E. Deelman, J. Blythe, C. Kesselman, and H. Tangmunarunkit. Artificial intelligence and grids: Workflow planning and beyond. IEEE Intelligent Systems, special issue on E-Science, Jan/Feb 2004. 35. GO. Gene ontology consortium. URL http://www.geneontology.org/. 36. C. Goble, C. Wroe, R. Stevens, and the mygrid consortium. The mygrid project: services, architecture and demonstrator. In UK e-science AHM, September 2003. 37. A. Goderis, U. Sattler, P. Lord, and C. Goble. Seven bottlenecks to workflow reuse and repurposing. In Fourth International Semantic Web Conference (ISWC 2005), volume 3792, pages 323 337, Galway, Ireland, 2005. 177

38. A. Goderis, P. Li, and C. Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. In IEEE International Conference on Web Services (ICWS 06), 2006. 39. GridSphere. Gridsphere portal framework. URL http://www.gridsphere. org/gridsphere/gridsphere?cid=2. 40. T. Gruber. What is an ontology. http://www-ksl.stanford.edu/kst/what-isan-ontology.html. 41. A. Gmez-Prez, R. Gonzlez-Cabero, and M. Lama. A framework for design and composition of semantic web services. American Association for Artificial Intelligence, 2004. 42. JLaunch. JLaunch from Duke bioinformatics shared resource. URL http: //dbsr.duke.edu/. 43. B. Johansson and P. Krus. A web service approach for model integration in computational design. In ASME 2003 design engineering technical conferences and computers and information in engineering conference, 2003. 44. E. Kawas, M. Senger, and M. D. Wilkinson. Biomoby extensions to the taverna workflow management and enactment software. BMC Bioinformatics, Nov. 2006. 45. M. Klein and A. Bernstein. Toward high-precision service retrieval. Internet Computing, 8(1):30 36, January/February 2004. 46. U. Kuter, E. Sirin, D. Nau, B. Parsia, and J. Hendler. Information gathering during planning for web service composition. In The third internatonal semantic web conference (ISWC2004), Hiroshima, Japan, 2004. 47. L. Li and I. Horrocks. A software framework for matchmaking based on semantic web technology. In Proceedings of the 12th international conference on World Wide Web, 2003. 48. Y. Liu, A. H. Ngu, and L. Zeng. Qos computation and policing in dynamic web service selection. In WWW2004, 2004. 49. P. Lord, S. Bechhofer, M. Wilkinson, G. Schiltz, D. Gessler, D. Hull, C. Goble, and L. Stein. Applying semantic web services to bioinformatics: Experiences gained, lessons learnt. In Third International Semantics Web Conference (ISWC2004), 2004. 178

50. P. Lord, P. Alper, C. Wroe, and C. Goble. Feta: A light-weight architecture for user oriented semantic service discovery. In In Proceedings of Second European Semantic Web Conference, ESWC 2005, pages 17 31. Springer- Verlag LNCS 3532 2005, May-June 2005. 51. Lucene. Apache lucene. URL http://lucene.apache.org/java/docs/ index.html. 52. B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice and Experience, 18(10): 1039 1065, Dec 2005. 53. E. M. Maximilien and M. P. Singh. Toward autonomic web services trust and selection. In ICSOC 04, 2004. 54. S. A. McIlraith, T. C. Son, and H. Zeng. Semantic web services. IEEE Intelligent Systems, pages 46 53, March/April 2001. 55. B. Medjahed, A. Bouguettaya, and A. K. Elmagarmid. Composing web services on the semantic web. The VLDB Journal, 2003. 56. E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. Observer: An approach for query processing in global information systems based on interoperation across pre-existing ontologies. In Intl. Conf. on Cooperative Information Systems (CoopIS 96), 1996. 57. F. Meyer. Genome sequencing vs. moore s law: Cyber challenges for the next decade. CTWatch Quarterly, 2(3), August 2006. 58. N. Milanovic and M. Malek. Current solutions for web service composition. IEEE Internet Computing, 8(6):51 59, November/December 2004. 59. J. A. Miller and P. A. Fishwick. Investigating ontologies for simulation modeling. In The 37th Annual Simulation Symposium, April 2004. 60. M. G. Nanda, S. Chandra, and V. Sarkar. Decentralizing execution of composite web services. In OOPSLA 04, 2004. 61. NCBI. Entrez: Making use of its power. Briefings in bioinformatics, 4(2), June 2003. URL http://www.ncbi.nih.gov/. 62. N. F. Noy and M. A. Musen. Prompt: Algorithm and tool for automated ontology merging and alignment. In The proceedings of the National conference on artificial intelligence (AAAI), 2000. 179

63. OGSA. Links to open grid service architecture. URL http://www.globus. org/ogsa/. 64. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver, A. Wipat, and P. Li. Taverna, lessons in creating a workflow environment for the life sciences. In GGF workflow workshop, 2004. 65. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17), 2004. 66. M. Ouzzani, B. Benatallah, and A. Bouguettaya. Ontological approach for information discovery in internet databases. Distributed and Parallel Databases, 8(3), 2000. 67. OWL. W3c OWL web ontology language overview. URL http://www.w3. org/tr/owl-features/. 68. R. D. M. Page. Treeview: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences, 12:357 358, 1996. 69. J. D. Palmer. The symbiotic birth and spread of plastids: How many times and whodunit? J. Phycol., 39, 2003. 70. M. P. Papazoglou and D. Georgakopoulos. Service-oriented computing. Communications of the ACM, 46(10), 2003. 71. A. Patil, S. Oundhakar, A. Sheth, and K. Verma. Meteor-s web service annotation framework. In Proceeding of the World Wide Web Conference, July 2004. 72. S. Pillai, V. Silventoinen, K. Kallio, M. Senger, S. Sobhany, J. Tate, S. Velankar, A. Golovin, K. Henrick, P. Rice, P. Stoehr, and R. Lopez. Soap-based services provided by the European Bioinformatics Institute (EBI). Nucleic Acids Res, 33(1):W25 W28, 2005. URL http://www.ebi.ac.uk/tools/ webservices/wsclustalw.html. 73. U. Radetzki and A. B. Cremers. Iris: A framework for mediator-based composition of service-oriented software. In 2004 IEEE International Conference on Web Services (ICWS 2004), July 2004. 74. U. Radetzki, U. Leser, S. Schulze-Rauschenbach, J. Zimmermann, J. Lussem, T. Bode, and A. Cremers. Adapters, shims, and glue service interoperability for in silico experiments. Bioinformatics, 22(9):1137 1143, 2006. 180

75. S. Ran. A model for web services discovery with QoS. ACM SIGecom Exchanges, 4(1), 2003. 76. J. Rao, D. Dimitrov, P. Hofmann, and N. Sadeh. A mixed initiative approach to semantic web service discovery and composition: Sap s guided procedures framework. In Proceedings of the IEEE International Conference on Web Services (ICWS 06), pages 401 410, 2006. 77. J. A. Raven and J. F. Allen. Genomics and chloroplast evolution: what did cyanobacteria do for plants? Genome Biology, 4, 2003. 78. J. Romero-Severson. Use case: How mog web services enable scientific discovery. Technical report, University of Notre Dame, August 2006. 79. S. Russell and P. Norvig. Artificial Intelligence A Mordern Approach. Prentice Hall, 1995. 80. M. Sabou, C. Wroe, C. Goble, and H. Stuckenschmidt. Learning domain ontologies for semantic web service descriptions. Journal of Web Semantics, 3(4), 2005. Accessible from: http://www.websemanticsjournal.org/ps/pub/2005-28. 81. SAWSDL. Semantic annotations for web services description language working group. URL http://www.w3.org/2002/ws/sawsdl/. 82. C. Schmidt and M. Parashar. A peer-to-peer approach to web service discovery. World Wide Web, 7(2), 2004. 83. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(5), Sept 2005. 84. E. Sirin, J. Hendler, and B. Parsia. Semi-automatic composition of web services using semantic descriptions. In Web Services: Modeling, Architecture and Infrastructure workshop in conjunction with ICEIS2003, 2003. 85. K. Sivashanmugam, K. Verma, A. Sheth, and J. Miller. Adding semantics to web services standards. In In Proceedings of the 1st International Conference on Web Services (ICWS 03), 2003. 86. K. Sivashanmugam, J. Miller, A. Sheth, and K. Verma. Framework for semantic web process composition. Special Issue of the International Journal of Electronic Commerce (IJEC), 2004. 87. SoapLab. Soap-based analysis web service developed in the European Bioinformatics Institute (EBI). URL http://www.ebi.ac.uk/soaplab/. 181

88. SpeedR. URL http://lsdis.cs.uga.edu/proj/meteor/mwsdi.html. 89. N. Srinivasan, M. Paolucci, and K. Sycara. Semantic web service discovery in the owl-s ide. In Proceeding of the 39th Hawaii International Conference on System Sciences, 2006. 90. B. Srivastava and J. Koehler. Web service composition current solutions and open problems. In ICAPS2003, 2003. 91. L. Stein. Creating a bioinformatics nation. Nature, 417(9), 2002. 92. L. D. Stein. Integrating biological databases. Nature Reviews genetics, 4, 2003. 93. R. Stevens. Trends in cyberinfrastructure for bioinformatics and computational biology. CTWatch Quarterly, 2(3), August 2006. URL Availableon-lineathttp://www.ctwatch.org/quarterly/. 94. R. Stevens, K. Glover, C. Greenhalgh, C. Jennings, S. Pearce, P. Li, M. Radenkovic, and A. Wipat. Performing in silico experiments on the grid: A users perspective. In Proc UK e-science programme All Hands Conference, 2003. 95. J. W. Stiller and D. C. Reel. A single origin of plastids revisited: Convergent evolution in organellar genome content. J. Phycol, 39, 2003. 96. I. Taylor, M. Shields, I. Wang, and A. Harrison. Visual Grid Workflow in Triana. Journal of Grid Computing, 3(3-4):153 169, September 2005. URL http://www.springerlink.com/openurl.asp?genre= article&issn=1570-7873&volume=3&issue=3&spage=153. 97. The Globus Project. The globus project. URL http://www.globus.org. 98. W. van der Aalst. Don t go with the flow: Web services composition standards exposed. IEEE Intelligent Systems, Jan/Feb 2003. 99. Y. Wang and E. Stroulia. Semantic structure matching for assessing webservice similarity. In M. E. Orlowska, S. Weerawarana, M. P. Papazoglou, and J. Yang, editors, Service-Oriented Computing - ICSOC 2003, 2003. 100. R. Weber, C. Schuler, P. Neukomm, H. Schuldt, and H.-J. Schek. Web service composition with o grape and osiris. In Proceeding of the 29th VLDB Conference, 2003. 101. M. D. Wilkinson and M. Links. Biomoby: An open source biological web service proposal. Briefings in bioinformatics, 3(4), 2002. 182

102. WordNet. Wordnet: A large lexical database of english, developed under the direction of george a. miller. URL http://wordnet.princeton.edu/. 103. C. Wroe, R. Stevens, C. Goble, A. Roberts, and M. Greenwood. A suite of daml+oil onotlogies to describe bioinformatics web services and data. International Journal of Cooperative Information Systems, 12(4):197 224, June 2003. 104. C. Wroe, C. Goble, A. Goderis, P. Lord, S. Miles, J. Papay, P. Alper, and L. Moreau. Recycling workflows and services through discovery and reuse. Concurrency and Computation: Practice and Experience, 2007. 105. WS. Web services architecture. URL http://www.w3.org/tr/ws-arch/ #service_oriented_architecture. W3C Working Group Note 11 February 2004. 106. WsBAW. Bioinformatic analysis workflow (WsBAW). URL http://www. alphaworks.ibm.com/tech/wsbaw. 107. WSIF. Web services invocation framework (wsif), apache software foundation. URL http://ws.apache.org/wsif/. 108. X. Xiang and G. Madey. A semantic web services enabled web portal architecture. In International Conference on Web Services (ICWS2004), 2004. 109. X. Xiang and G. Madey. Improving the reuse of scientific workflows and their by-products. URL http://www.nd.edu/~mog/papers/papers.html. Working paper, 2007. 110. X. Xiang, G. Madey, and J. Romero-Severson. A service-oriented data integration and analysis environment for in-silico experiments and bioinformatics research. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences (CD-ROM), January 2007. 111. J. Yang. Web service componentization. Communication of the ACM, October 2003. 112. X. Yi and K. J. Kochut. Process composition of web services with complex conversation protocols: A colored petri nets based approach. In Design, Analysis and Simulation of Distributed System DASD2004, 2004. 113. U. Zdun, M. Voelter, and M. Kircher. Design and implementation of an asynchronous invocation framework for web services. In The International Conference on Web Services - Europe 2003 (ICWS-Europe 03), 2003. This document was prepared & typeset with pdfl A TEX, and formatted with nddiss2ε classfile (v1.0[2004/06/15]) provided by Sameer Vijay. 183