Bio-DSGS: An Automated Bioinformatics Data Service Generation System

Journal of Computational Information Systems 7: 8 (2011) 2989-2996 Available at http://www.jofcis.com Bio-DSGS: An Automated Bioinformatics Data Service Generation System Shuang QIU, Yadong WANG, Liang CHENG, Yongzhuang LIU Center for Biomedical Informatics, Harbin Institute of Technology, Harbin 150001, China Abstract With the development of bioinformatics at molecular level, a significant amount of data has been generated and stored in various databases. The data is often required from multiple bioinformatics databases in bioinformatics research. However, due to geographical distribution and heterogeneity of bioinformatics databases, there are still many problems in the way of data sharing and access. In this paper, based on the service-oriented architecture and metadata management technique, we proposed and implemented an integrated system named Bio-DSGS. Bio-DSGS enables the bioinformatics databases to be automatically encapsulated into the data services and be published on the Internet. As a result, the heterogeneous data resources of biological information will be facilitated to be accessed, shared and integrated. Keywords: Bioinformatics Database; Data Service; Metadata; Service-oriented Architecture 1. Introduction With the development of genomics and the popularity of computer networks, a large number of bioinformatics databases have been established, which are used to collect, analyze, sort and publish various biological information of animals, plants and microorganisms [1]. In bioinformatics research, data is often supplied from multiple bioinformatics databases. However the existing bioinformatics databases have many problems in the way of data sharing and access. Firstly, it is not easy to share the data in the geographically distributed bioinformatics databases. Secondly, due to lack of standardized description of data in the bioinformatics databases, users are hard to understand the data. Thirdly, there is no uniform way of data access that is propitious for user. If the owners of bioinformatics databases provide data services for data access, the difficulty of data sharing in bioinformatics databases can be solved well. Data service [2] is based on Web service [3], which provides the service for the data access. The data service may visit bioinformatics database and return the obtained result to the user according to his request. But for a large number of bioinformatics databases, there will be quite difficult for the developers if they prepare data services for all databases manually, for example, heavy workload, high cost, fallibility, non-uniform data access. In this paper, according to the requests of bioinformatics data in sharing and access, based on SOA [4], we proposed and implemented a bioinformatics data service generation system (Bio-DSGS), which is able Corresponding author. Email addresse: ydwang@hit.edu.cn (Yadong WANG). 1553-9105/ Copyright 2011 Binary Information Press August, 2011

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2990 to realize the automatic generation of data services and reduce the development costs of data service significantly by defining the metadata data and the object class file under the characteristics of data service and bioinformatics database. By introducing the metadata management [5,6], the system is able to solve problems of data indigestibility, heterogeneity and other issues in bioinformatics databases, furthermore, it realizes the data service with uniform access, which is automatically generated from bioinformatics databases. The remainder of the paper is organized as follows: Section 2 presents functions of each part of the system; Section 3 illustrates the workflow; Section 4 shows the generation flow of data services, and brings forward the feasibility and effectiveness of the system by performing access test to generated data services. Finally, the conclusion and the future work are proposed in Section 5. 2. System Architecture The architecture of Bio-DSGS is shown in Figure 1, which consists of three parts: data resource, data management engine, data service management engine. To perform uniform data access, publish data in the way of the service, and achieve the secure data accessing, a series of management tools are used in the system, including the metadata-extraction tool, the class information-extraction tool, the data service-generation tool, and so on. Fig.1 The Architecture of Bio-DSGS 2.1. Data Resource Data resources are geographically distributed and heterogeneous bioinformatics databases, which are derived from the scientific research of bioinformatics. The bioinformatics databases can be classified into four major categories: the genome database [7], the first-class structure of a protein and nucleic acid sequence database [8], three-dimensional structure database of the biological macromolecules [9] and the secondary-class database on the basis of the three above-mentioned category databases for information and

2991 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 documentation. 2.2. Data Management Engine Data management engine is the core module in Bio-DSGS, which ensures the uniform data access by extracting the metadata from database and mapping the metadata file to data object class file. The data management engine includes the metadata management engine and the data object management engine. 1) Metadata management engine Based on Dublin Core standard [10], Bio-DSGS establishes the metadata schema. With the schema, the metadata management engine can utilize the metadata-extraction tool to extract information (such as datasheets, table field/properties, data structure, and database security access, et al.) from the bioinformatics database, which will be converted into the metadata files. And the metadata files are managed by metadata-vocabulary-management tool, metadata-model-management tool et al.. Functions of above two tools are described as follows: Metadata-vocabulary-management tool manages descriptive vocabularies for data in metadata; Metadata-model-management tool is adopted to manage modes of metadata. 2) Data object management engine Data service cannot access to the database through the metadata file which is just the abstract information from the database. Therefore it is necessary to convert the metadata file into the data object class file, through which the data service can access to the data resource. The system constructs the data object management engine, which is used to convert the metadata file into data object class file, to control the management of data object class file and the operation of data access. Based on the data object class files, the system can provide the uniform data access. Data resource access is performed by operating the data object class file, thus it is necessary to set up the control of the access to data resources and the security in the data object management, which will ultimately enhance the security and reliability during visiting the data resource. To achieve above goal, two tools are proposed and the functions of the two tools are described as follows: Access-management tool for data object class files transforms the access to the data object class files into the access to the database; Security-management tool for data object class files is used to ensure the security for the access to the data object class files. 2.3. Data Service Management Engine Data service management engine consists of the data service generation engine and the data service container. Through the data service generation tool, the data service management engine encapsulates the general data package of established data service and the data object class files into the bioinformatics data service, and then publishes the data service into the data service container, which can make the client convenient to call for the data service and achieve the purpose of uniform data access. The data service container manages the published data services through the indexing service, which can facilitate the client to retrieve the published data service.

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2992 3. System Workflow Bio-DSGS converts bioinformatics databases to data services by using a series of tools, which facilitates the bioinformatics database resource access and sharing. The data service generation process and the client access process through the data service will be introduced as follows. 3.1. Data Service Generation Process Data service generation process is shown as Figure 2. The process consists of three steps: 1) Metadata extraction process. The metadata file is constructed by the metadata-extraction tool. 2) Data object class information extraction process. The metadata is mapped into the data object class file by the class information-extraction tool. 3) Class files and the constructed general data service package are published as the data service by the data service generation tool. Fig.2 Data Service Generation Process 3.2. Database Access Process in the Client by the Data Service As is shown in Figure 3, the process consists of four steps: 1) The user submits the data access request to the data service by the data service client. 2) The data service transforms the data access request into the data object class access request file. 3) The access-management tool for the data object class files converts the data object class access request to the database access request, which realizes the database access. 4) The data service returns the result of data access to the user. Fig.3 Database Access Process in the Client 4. Example In this section, we will illustrate how to construct the data service in the system and ensure the reliability and validity of the generated data service by checking the database performance test in the client.

2993 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 4.1. mir2disease Database mir2disease database [11], which aims to provide an integrated resource of microrna disorder in human diseases, is a database that the data is verified manually. The database contains 1939 corresponding relations between 299 micrornas from more than 600 literatures and more than 100 kinds of diseases. 4.2. Generation Data Service Based on the mir2disease Generation data service based on the mir2disease is processed as the following steps: Step1. The Information of datasheets in mir2disease database is shown as Table 1. Metadata are extracted from mir2diseas database in terms of datasheets, properties of datasheets, access information of the database by metadata-extraction tool, in which metadata in the mirna2disease sheet is shown as Table 2, then the metadata files are generated. Step2. Metadata files are transmitted into a series of data object class files by class information-extraction tool. Step3. The general data package of established data service and the data object class file are packed as the data service, and then published. The web site of mir2disease data service is http://mlg.hit.edu.cn:8080/mir/services/mirservice?wsdl. 4.3. Database Access Performance Test in the Client The system establishes a reusable layer of data services between the bioinformatics database and the application program. Users can obtain the data by accessing the data service instead of the database. For the former, the time expenditure of data access will increase slightly, in the actual application, though, both the network transmission and the database query spend more time relatively. Thus the additional expenditure of time is just the one that the service converts the user s request to the database access request. This additional time, compared to the time of network transmission delay and the database query time, is negligible and can be accepted by users, thus the service is feasible. Here we take the query of mir2disease database as an example. The response time interval from the time when the system obtains a request to the time when the request is converted into the database access request, the query time in the database, and time for network transmission of user s request together with results are tested respectively. The results are shown in Table 3. From results in above test, we can see that the network transmission delay and the database query time are determined by a variety of reasons, such as the size of database, the complexity of inquire statement, the size of transferring data and so on. Compared to the network transmission delay and the database query time, the switching interval is so short that it can be ignored. The additional expenditure of time by the data service is much less than 1s, which can be accepted by users. Using the data service has little effect on the performance of database access for users. Therefore the data service can meet user s request.

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2994 Table 1 Names and Contents in the mir2disease Database Sheet Table Name Content Description disease_ mapping This table stores all of the standard disease names and identifications in Disease Ontology. mirna2disease This table stores the correspondence between microrna and diseases. reference_ information This table stores the references for correspondence between microrna and diseases. tarbaselist This table stores the names of target genes which transcript into microrna targetlist This table also stores names of target genes which transcript into microrna, but its source is different from tarbaselist. 5. Conclusion In conclusion, we have analyzed the problems of data access and sharing in bioinformatics databases. A data service generation system of bioinformatics database is designed and completed. The system can establish data services for geographically distributed and heterogeneous bioinformatics databases, which can provide a unified way of data access. In next step, the major work is about the construction of the uniform ontology of data services in the field of bioinformatics, the semantic description of each data service will be performed to facilitate the integration of data services, and furthermore, the basis of the integration of bioinformatics databases can be provided. Table 2 Metadata Extracted from the Mirna2disease Database Sheet Data Element Definition Range Conceptual Domain Remarks id Integer, positive, unlimited length Identifier, corresponds to the primary key of the database sheet, no practical significance Corresponds to the record_ id field in the sheet mirna String, such as hsa-mir-125b-2, et al., only the standard name of microrna, no alias or abbreviation, name is divided into 3 or 4 sections, linked by - The standard name of microrna Corresponds to the mirna field in the sheet doid Integer, positive, unlimited length The disease id provided by Disease Ontology Corresponds to the DOID field in the sheet disease String, lowercased English, abbreviation can be incidental, for example, follicular lymphoma (FL), length less than 255 Name of disease, using the generic name provided by Disease Ontology Corresponds to the Disease field in the sheet

2995 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 expression String, enumerator, three values: normal, down-regulated, up-regulated Express the amount of genes transcripted to microrna Corresponds to the Expression field in the sheet, the amount of transcription is 10, if normal, 5 if down-regulated, 15 if up-regulated method String, enumerator, two values, microarray and Northern blot, qrt-pcr etc The method for measuring the correspondence between mirna and disease Corresponds to the Method field in the sheet, microarray represents method of microarray, Northern blot represents RNA blotting, qrt-pcr represents the method for reverse transcription quantitative PCR. description String, unlimited length, unlimited format Detailed description of associations between disease and microrna. Corresponds to the Description field in the sheet target String, length less than 255, names of target genes is connected by,, use "unknown" for none or uncertain The names of target genes corresponding to MicroRNA Corresponds to the target field in the sheet, the result comes from experiments by the user tarbase String, length less than 255, names of target genes is connected by,, use "unknown" for none or uncertain The names of target genes corresponding to MicroRNA Corresponds to the tarbase field in the sheet, the result comes from references years Integer, four bits, years Time pmid Integer, usually eight bits The id for the reference Corresponds to the Years field in the sheet, it is the year when the reference published Corresponds to the PMID field in the sheet, references on the correspondence between disease and microrna relationship_ type Integer, enumerator, three values: 0,1,2 The types of relationship between disease and microrna Corresponds to the relationship_ type field in the sheet, 0 represents relationship has not been determined, 1 represents microrna results in disease, 2 represents disease results in microrna causal String, enumerator, two values: Unspecified and Causal influence between mirna and disease or not Corresponds to the Causal field in the sheet. Unspecified represents the relationship is unspecified, Causal represents the certain causal relationship Table 3 Performance Test Invoked by the Service Items in the Test Switching Interval Query Time Time for Network Transmission Obtain all information about hsa-let-7g in the mirna2disease Inquire the published date of a literature in the reference_ information table < 1ms 578ms 2342ms < 1ms 26ms 1325ms Search all literatures about the lung disease < 1ms 130ms 2081ms Inquire the microrna information associated with diseases in some literature Obtain the first 200 information in the mirna2disease table < 1ms 678ms 3109ms < 1ms 191ms 2207ms

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2996 Acknowledgement The research work is supported by National Key Technology R&D Program (2008BAI64B03). References [1] M.Y. Galperin, G.R. Cochrane. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res, 37(Database issue):d1-4, 2009. [2] B.G. Cui. A Data Service Virtualization Mechanism for Dynamic Data Integration. Journal of Computational Information Systems, 4(2):665-670, 2008. [3] P. Muschamp. An Introduction to Web Service. BT Technology Journal, 22(1):9-18, 2004. [4] S. Kleijnen, S. Raju. An Open Web Services Architecture. Queue, 1(1):38-46, 2006. [5] G.K. Phokion. Schema Mappings, Data Exchange, and Metadata Management. Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, USA, pages 61-75, 2005. [6] M. Ji, X. Xu, H.C. Zhu, Z. Li, S.S. Wang. Design and Realization of the Mining Multi-scale Spatial Database based on Metadata Operation. The 3 rd IEEE International Conference on Computer Science and Information Technology. Chengdu, China, pages 175-179, 2010. [7] T. Hubbard et al.. The Ensemble Genome Database Project. Nucleic Acids Res, 30(1): 38-41, 2002. [8] A. Bairoch, R. Apweiler. The SWISS- PROT Protein Sequence Database and its Supplement TrEMBL in 2000. Nucleic Acids Res, 28(1): 45-48, 2000. [9] A. Stein, R.B. Russell, P. Aloy. 3did: Interacting Protein Domains of Known Three-dimensional Structure. Nucleic Acids Res, 33(suppl 1):D413-D417, 2005. [10] S. Bird, G. Simons. Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources. Computers and the Humanities, 37(4):375-388, 2003. [11] Q.H. Jiang, Y.D. Wang, Y.Y. Hao, L.R. Juan, M.X. Teng, X.J. Zhang, M.M. Li, G.H. Wang and Y.L. Liu. mir2disease: a Manually Curated Database for microrna Deregulation in Human Disease. Nucleic Acids Res, 37(Database issue):d98-104, 2009.