Bio-DSGS: An Automated Bioinformatics Data Service Generation System



Similar documents
Remote Sensitive Image Stations and Grid Services

Web-Based Genomic Information Integration with Gene Ontology

Design of Data Archive in Virtual Test Architecture

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Intro to Bioinformatics

Design and Implementation of IaaS platform based on tool migration Wei Ding

Implementation of Information Integration Platform in Chinese Tobacco Industry Enterprise Based on SOA. Hong-lv Wang, Yong Cen

Research of PROFIBUS PA s integration in PROFINET IO

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

A SaaS-based Logistics Informatization Model for Specialized Farmers Cooperatives in China

Product data model for PLM system

Design call center management system of e-commerce based on BP neural network and multifractal

Design of Network Educating Information System Based on Use Cases Driven Shenwei Wang 1 & Min Guo 2

Cloud and Open BIM-Based Building Information Interoperability Research *

CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW

An ECG Monitoring and Alarming System Based On Android Smart Phone

The Power Marketing Information System Model Based on Cloud Computing

Lightweight Data Integration using the WebComposition Data Grid Service

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

Capability Service Management System for Manufacturing Equipments in

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Development of a Web-based Information Service Platform for Protected Crop Pests

Design and Implementation of Supermarket Management System Yongchang Rena, Mengyao Chenb

Data Grids. Lidan Wang April 5, 2007

Study on Architecture and Implementation of Port Logistics Information Service Platform Based on Cloud Computing 1

A Primer of Genome Science THIRD

Project Knowledge Management Based on Social Networks

AN APPROACH TO DEVELOPING BUSINESS PROCESSES WITH WEB SERVICES IN GRID

A Service Modeling Approach with Business-Level Reusability and Extensibility

Intelligent Manage for the Operating System Services

Modern Accounting Information System Security (AISS) Research Based on IT Technology

HL7 and SOA Based Distributed Electronic Patient Record Architecture Using Open EMR

SPMF: a Java Open-Source Pattern Mining Library

Visualization Method of Trajectory Data Based on GML, KML

Bioinformatics Grid - Enabled Tools For Biologists.

Big Data Storage Architecture Design in Cloud Computing

Zhenping Liu *, Yao Liang * Virginia Polytechnic Institute and State University. Xu Liang ** University of California, Berkeley

A Case Study of Question Answering in Automatic Tourism Service Packaging

Load Balancing Algorithm Based on Services

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Distributed Database for Environmental Data Integration

Design and Implementation of Production Management Information System for Jiujiang Railway Track Depot

Digital Modernization of Oilfields Digital Oilfield to Intelligent Oilfield. Karamay Hongyou Software Co., Ltd.

EUR-Lex 2012 Data Extraction using Web Services

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Make search become the internal function of Internet

Data Mining Governance for Service Oriented Architecture

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Microarray Technology

AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS

Scientific databases. Biological data management

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Enable Location-based Services with a Tracking Framework

Wireless Sensor Networks Database: Data Management and Implementation

A Service Revenue-oriented Task Scheduling Model of Cloud Computing

MULTI AGENT-BASED DISTRIBUTED DATA MINING

Module 1. Sequence Formats and Retrieval. Charles Steward

Using Hierarchical Task Network Planning Techniques to Create Custom Web Search Services over Multiple Biomedical Databases

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

DBaaS Using HL7 Based on XMDR-DAI for Medical Information Sharing in Cloud

Rotorcraft Health Management System (RHMS)

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Research and realization of Resource Cloud Encapsulation in Cloud Manufacturing

Intelligent Human Machine Interface Design for Advanced Product Life Cycle Management Systems

One Continuous Auditing Practice in China: Data-oriented Online Auditing(DOOA)

Database Construction of Real Estate Assessment Based on Big Data Liang Zhou 1,2,a, Liang Shi 3, Sijia Zhang

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

A Summary of Principles of Enterprise Architecture of Public Entities

Data Integration Hub for a Hybrid Paper Search

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Business Rule Standards -- Interoperability and Portability

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Transcription:

Journal of Computational Information Systems 7: 8 (2011) 2989-2996 Available at http://www.jofcis.com Bio-DSGS: An Automated Bioinformatics Data Service Generation System Shuang QIU, Yadong WANG, Liang CHENG, Yongzhuang LIU Center for Biomedical Informatics, Harbin Institute of Technology, Harbin 150001, China Abstract With the development of bioinformatics at molecular level, a significant amount of data has been generated and stored in various databases. The data is often required from multiple bioinformatics databases in bioinformatics research. However, due to geographical distribution and heterogeneity of bioinformatics databases, there are still many problems in the way of data sharing and access. In this paper, based on the service-oriented architecture and metadata management technique, we proposed and implemented an integrated system named Bio-DSGS. Bio-DSGS enables the bioinformatics databases to be automatically encapsulated into the data services and be published on the Internet. As a result, the heterogeneous data resources of biological information will be facilitated to be accessed, shared and integrated. Keywords: Bioinformatics Database; Data Service; Metadata; Service-oriented Architecture 1. Introduction With the development of genomics and the popularity of computer networks, a large number of bioinformatics databases have been established, which are used to collect, analyze, sort and publish various biological information of animals, plants and microorganisms [1]. In bioinformatics research, data is often supplied from multiple bioinformatics databases. However the existing bioinformatics databases have many problems in the way of data sharing and access. Firstly, it is not easy to share the data in the geographically distributed bioinformatics databases. Secondly, due to lack of standardized description of data in the bioinformatics databases, users are hard to understand the data. Thirdly, there is no uniform way of data access that is propitious for user. If the owners of bioinformatics databases provide data services for data access, the difficulty of data sharing in bioinformatics databases can be solved well. Data service [2] is based on Web service [3], which provides the service for the data access. The data service may visit bioinformatics database and return the obtained result to the user according to his request. But for a large number of bioinformatics databases, there will be quite difficult for the developers if they prepare data services for all databases manually, for example, heavy workload, high cost, fallibility, non-uniform data access. In this paper, according to the requests of bioinformatics data in sharing and access, based on SOA [4], we proposed and implemented a bioinformatics data service generation system (Bio-DSGS), which is able Corresponding author. Email addresse: ydwang@hit.edu.cn (Yadong WANG). 1553-9105/ Copyright 2011 Binary Information Press August, 2011

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2990 to realize the automatic generation of data services and reduce the development costs of data service significantly by defining the metadata data and the object class file under the characteristics of data service and bioinformatics database. By introducing the metadata management [5,6], the system is able to solve problems of data indigestibility, heterogeneity and other issues in bioinformatics databases, furthermore, it realizes the data service with uniform access, which is automatically generated from bioinformatics databases. The remainder of the paper is organized as follows: Section 2 presents functions of each part of the system; Section 3 illustrates the workflow; Section 4 shows the generation flow of data services, and brings forward the feasibility and effectiveness of the system by performing access test to generated data services. Finally, the conclusion and the future work are proposed in Section 5. 2. System Architecture The architecture of Bio-DSGS is shown in Figure 1, which consists of three parts: data resource, data management engine, data service management engine. To perform uniform data access, publish data in the way of the service, and achieve the secure data accessing, a series of management tools are used in the system, including the metadata-extraction tool, the class information-extraction tool, the data service-generation tool, and so on. Fig.1 The Architecture of Bio-DSGS 2.1. Data Resource Data resources are geographically distributed and heterogeneous bioinformatics databases, which are derived from the scientific research of bioinformatics. The bioinformatics databases can be classified into four major categories: the genome database [7], the first-class structure of a protein and nucleic acid sequence database [8], three-dimensional structure database of the biological macromolecules [9] and the secondary-class database on the basis of the three above-mentioned category databases for information and

2991 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 documentation. 2.2. Data Management Engine Data management engine is the core module in Bio-DSGS, which ensures the uniform data access by extracting the metadata from database and mapping the metadata file to data object class file. The data management engine includes the metadata management engine and the data object management engine. 1) Metadata management engine Based on Dublin Core standard [10], Bio-DSGS establishes the metadata schema. With the schema, the metadata management engine can utilize the metadata-extraction tool to extract information (such as datasheets, table field/properties, data structure, and database security access, et al.) from the bioinformatics database, which will be converted into the metadata files. And the metadata files are managed by metadata-vocabulary-management tool, metadata-model-management tool et al.. Functions of above two tools are described as follows: Metadata-vocabulary-management tool manages descriptive vocabularies for data in metadata; Metadata-model-management tool is adopted to manage modes of metadata. 2) Data object management engine Data service cannot access to the database through the metadata file which is just the abstract information from the database. Therefore it is necessary to convert the metadata file into the data object class file, through which the data service can access to the data resource. The system constructs the data object management engine, which is used to convert the metadata file into data object class file, to control the management of data object class file and the operation of data access. Based on the data object class files, the system can provide the uniform data access. Data resource access is performed by operating the data object class file, thus it is necessary to set up the control of the access to data resources and the security in the data object management, which will ultimately enhance the security and reliability during visiting the data resource. To achieve above goal, two tools are proposed and the functions of the two tools are described as follows: Access-management tool for data object class files transforms the access to the data object class files into the access to the database; Security-management tool for data object class files is used to ensure the security for the access to the data object class files. 2.3. Data Service Management Engine Data service management engine consists of the data service generation engine and the data service container. Through the data service generation tool, the data service management engine encapsulates the general data package of established data service and the data object class files into the bioinformatics data service, and then publishes the data service into the data service container, which can make the client convenient to call for the data service and achieve the purpose of uniform data access. The data service container manages the published data services through the indexing service, which can facilitate the client to retrieve the published data service.

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2992 3. System Workflow Bio-DSGS converts bioinformatics databases to data services by using a series of tools, which facilitates the bioinformatics database resource access and sharing. The data service generation process and the client access process through the data service will be introduced as follows. 3.1. Data Service Generation Process Data service generation process is shown as Figure 2. The process consists of three steps: 1) Metadata extraction process. The metadata file is constructed by the metadata-extraction tool. 2) Data object class information extraction process. The metadata is mapped into the data object class file by the class information-extraction tool. 3) Class files and the constructed general data service package are published as the data service by the data service generation tool. Fig.2 Data Service Generation Process 3.2. Database Access Process in the Client by the Data Service As is shown in Figure 3, the process consists of four steps: 1) The user submits the data access request to the data service by the data service client. 2) The data service transforms the data access request into the data object class access request file. 3) The access-management tool for the data object class files converts the data object class access request to the database access request, which realizes the database access. 4) The data service returns the result of data access to the user. Fig.3 Database Access Process in the Client 4. Example In this section, we will illustrate how to construct the data service in the system and ensure the reliability and validity of the generated data service by checking the database performance test in the client.

2993 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 4.1. mir2disease Database mir2disease database [11], which aims to provide an integrated resource of microrna disorder in human diseases, is a database that the data is verified manually. The database contains 1939 corresponding relations between 299 micrornas from more than 600 literatures and more than 100 kinds of diseases. 4.2. Generation Data Service Based on the mir2disease Generation data service based on the mir2disease is processed as the following steps: Step1. The Information of datasheets in mir2disease database is shown as Table 1. Metadata are extracted from mir2diseas database in terms of datasheets, properties of datasheets, access information of the database by metadata-extraction tool, in which metadata in the mirna2disease sheet is shown as Table 2, then the metadata files are generated. Step2. Metadata files are transmitted into a series of data object class files by class information-extraction tool. Step3. The general data package of established data service and the data object class file are packed as the data service, and then published. The web site of mir2disease data service is http://mlg.hit.edu.cn:8080/mir/services/mirservice?wsdl. 4.3. Database Access Performance Test in the Client The system establishes a reusable layer of data services between the bioinformatics database and the application program. Users can obtain the data by accessing the data service instead of the database. For the former, the time expenditure of data access will increase slightly, in the actual application, though, both the network transmission and the database query spend more time relatively. Thus the additional expenditure of time is just the one that the service converts the user s request to the database access request. This additional time, compared to the time of network transmission delay and the database query time, is negligible and can be accepted by users, thus the service is feasible. Here we take the query of mir2disease database as an example. The response time interval from the time when the system obtains a request to the time when the request is converted into the database access request, the query time in the database, and time for network transmission of user s request together with results are tested respectively. The results are shown in Table 3. From results in above test, we can see that the network transmission delay and the database query time are determined by a variety of reasons, such as the size of database, the complexity of inquire statement, the size of transferring data and so on. Compared to the network transmission delay and the database query time, the switching interval is so short that it can be ignored. The additional expenditure of time by the data service is much less than 1s, which can be accepted by users. Using the data service has little effect on the performance of database access for users. Therefore the data service can meet user s request.

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2994 Table 1 Names and Contents in the mir2disease Database Sheet Table Name Content Description disease_ mapping This table stores all of the standard disease names and identifications in Disease Ontology. mirna2disease This table stores the correspondence between microrna and diseases. reference_ information This table stores the references for correspondence between microrna and diseases. tarbaselist This table stores the names of target genes which transcript into microrna targetlist This table also stores names of target genes which transcript into microrna, but its source is different from tarbaselist. 5. Conclusion In conclusion, we have analyzed the problems of data access and sharing in bioinformatics databases. A data service generation system of bioinformatics database is designed and completed. The system can establish data services for geographically distributed and heterogeneous bioinformatics databases, which can provide a unified way of data access. In next step, the major work is about the construction of the uniform ontology of data services in the field of bioinformatics, the semantic description of each data service will be performed to facilitate the integration of data services, and furthermore, the basis of the integration of bioinformatics databases can be provided. Table 2 Metadata Extracted from the Mirna2disease Database Sheet Data Element Definition Range Conceptual Domain Remarks id Integer, positive, unlimited length Identifier, corresponds to the primary key of the database sheet, no practical significance Corresponds to the record_ id field in the sheet mirna String, such as hsa-mir-125b-2, et al., only the standard name of microrna, no alias or abbreviation, name is divided into 3 or 4 sections, linked by - The standard name of microrna Corresponds to the mirna field in the sheet doid Integer, positive, unlimited length The disease id provided by Disease Ontology Corresponds to the DOID field in the sheet disease String, lowercased English, abbreviation can be incidental, for example, follicular lymphoma (FL), length less than 255 Name of disease, using the generic name provided by Disease Ontology Corresponds to the Disease field in the sheet

2995 S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 expression String, enumerator, three values: normal, down-regulated, up-regulated Express the amount of genes transcripted to microrna Corresponds to the Expression field in the sheet, the amount of transcription is 10, if normal, 5 if down-regulated, 15 if up-regulated method String, enumerator, two values, microarray and Northern blot, qrt-pcr etc The method for measuring the correspondence between mirna and disease Corresponds to the Method field in the sheet, microarray represents method of microarray, Northern blot represents RNA blotting, qrt-pcr represents the method for reverse transcription quantitative PCR. description String, unlimited length, unlimited format Detailed description of associations between disease and microrna. Corresponds to the Description field in the sheet target String, length less than 255, names of target genes is connected by,, use "unknown" for none or uncertain The names of target genes corresponding to MicroRNA Corresponds to the target field in the sheet, the result comes from experiments by the user tarbase String, length less than 255, names of target genes is connected by,, use "unknown" for none or uncertain The names of target genes corresponding to MicroRNA Corresponds to the tarbase field in the sheet, the result comes from references years Integer, four bits, years Time pmid Integer, usually eight bits The id for the reference Corresponds to the Years field in the sheet, it is the year when the reference published Corresponds to the PMID field in the sheet, references on the correspondence between disease and microrna relationship_ type Integer, enumerator, three values: 0,1,2 The types of relationship between disease and microrna Corresponds to the relationship_ type field in the sheet, 0 represents relationship has not been determined, 1 represents microrna results in disease, 2 represents disease results in microrna causal String, enumerator, two values: Unspecified and Causal influence between mirna and disease or not Corresponds to the Causal field in the sheet. Unspecified represents the relationship is unspecified, Causal represents the certain causal relationship Table 3 Performance Test Invoked by the Service Items in the Test Switching Interval Query Time Time for Network Transmission Obtain all information about hsa-let-7g in the mirna2disease Inquire the published date of a literature in the reference_ information table < 1ms 578ms 2342ms < 1ms 26ms 1325ms Search all literatures about the lung disease < 1ms 130ms 2081ms Inquire the microrna information associated with diseases in some literature Obtain the first 200 information in the mirna2disease table < 1ms 678ms 3109ms < 1ms 191ms 2207ms

S. Qiu et al. /Journal of Computational Information Systems 7:8 (2011) 2989-2996 2996 Acknowledgement The research work is supported by National Key Technology R&D Program (2008BAI64B03). References [1] M.Y. Galperin, G.R. Cochrane. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res, 37(Database issue):d1-4, 2009. [2] B.G. Cui. A Data Service Virtualization Mechanism for Dynamic Data Integration. Journal of Computational Information Systems, 4(2):665-670, 2008. [3] P. Muschamp. An Introduction to Web Service. BT Technology Journal, 22(1):9-18, 2004. [4] S. Kleijnen, S. Raju. An Open Web Services Architecture. Queue, 1(1):38-46, 2006. [5] G.K. Phokion. Schema Mappings, Data Exchange, and Metadata Management. Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, USA, pages 61-75, 2005. [6] M. Ji, X. Xu, H.C. Zhu, Z. Li, S.S. Wang. Design and Realization of the Mining Multi-scale Spatial Database based on Metadata Operation. The 3 rd IEEE International Conference on Computer Science and Information Technology. Chengdu, China, pages 175-179, 2010. [7] T. Hubbard et al.. The Ensemble Genome Database Project. Nucleic Acids Res, 30(1): 38-41, 2002. [8] A. Bairoch, R. Apweiler. The SWISS- PROT Protein Sequence Database and its Supplement TrEMBL in 2000. Nucleic Acids Res, 28(1): 45-48, 2000. [9] A. Stein, R.B. Russell, P. Aloy. 3did: Interacting Protein Domains of Known Three-dimensional Structure. Nucleic Acids Res, 33(suppl 1):D413-D417, 2005. [10] S. Bird, G. Simons. Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources. Computers and the Humanities, 37(4):375-388, 2003. [11] Q.H. Jiang, Y.D. Wang, Y.Y. Hao, L.R. Juan, M.X. Teng, X.J. Zhang, M.M. Li, G.H. Wang and Y.L. Liu. mir2disease: a Manually Curated Database for microrna Deregulation in Human Disease. Nucleic Acids Res, 37(Database issue):d98-104, 2009.