Computational Biosciences Section Life Sciences Division Oak Ridge National Laboratory Oak Ridge, TN

Similar documents
Introduction to Genome Annotation

Asynchronous data change notification between database server and accelerator control systems

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Gene Models & Bed format: What they represent.

Sensitive Unclassified Data Encryption and Networking Applications

Dynamic Vulnerability Assessment

IDC Reengineering Phase 2 & 3 US Industry Standard Cost Estimate Summary

Second Line of Defense Virtual Private Network Guidance for Deployed and New CAS Systems

Introduction to Bioinformatics 3. DNA editing and contig assembly

Final Project Report

LLNL-TR SAVANT Status Report. Arden Dougan. November 6, 2009

Drupal Automated Testing: Using Behat and Gherkin

Y-12 EMBOS Medical Lab Interface Batch Loader

The Production Cluster Construction Checklist

LifeScope Genomic Analysis Software 2.5

Science Goals for the ARM Recovery Act Radars

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

NEAMS Software Licensing, Release, and Distribution: Implications for FY2013 Work Package Planning

Panel Session #4 Smart Grid Workforce Training and Education: Identifying the Industry Perspective

Proposal: Application of Agile Software Development Process in xlpr ORNL- 2012/41412 November 2012

DataFoundry Data Warehousing and Integration for Scientific Data Management

Testing of Kaonetics Devices at BNL

September, Collider-Accelerator Department. Brookhaven National Laboratory P.O. Box 5000 Upton, NY www. bnl.gov

Measurement of BET Surface Area on Silica Nanosprings

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Energy Systems Integration

This document was prepared in conjunction with work accomplished under Contract No. DE-AC09-96SR18500 with the U. S. Department of Energy.

DP- MS PROPER STATISTICAL TREATMENT OF SPECIES-AREA DATA. Savannah River Laboratory Aiken, SC

This document was prepared in conjunction with work accomplished under Contract No. DE-AC09-96SR18500 with the U. S. Department of Energy.

Laser Safety Audit and Inventory System Database

A Systems Approach to HVAC Contractor Security

Regional Field Verification Operational Results from Four Small Wind Turbines in the Pacific Northwest

Should Costing Version 1.1

An Overview of Cells and Cell Research

A Primer of Genome Science THIRD

Biological Databases and Protein Sequence Analysis

TRANSPORTATION ROUTING ANALYSIS GEOGRAPHIC INFORMATION SYSTEM TRAGIS, PROGRESS ON IMPROVING A ROUTING TOOL

Bioinformatics Resources at a Glance

Penetration Testing of Industrial Control Systems

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Building Analytics. Managed Services. Better Building Alliance Department of Energy April 17, 2015

Early Fuel Cell Market Deployments: ARRA and Combined (IAA, DLA, ARRA)

Nuclear Engineering Enrollments and Degrees Survey, 2013 Data

G E N OM I C S S E RV I C ES

Oracle Endeca Information Discovery Integrator

WET BULB GLOBE TEMPERATURE MEASUREMENT AT THE Y-12 NATIONAL SECURITY COMPLEX

Human Genome and Human Genome Project. Louxin Zhang

Windows Embedded Security and Surveillance Solutions

Driving workload automation across the enterprise

ESNET Requirements for Physics Research at the SSCL

The Galaxy workflow. George Magklaras PhD RHCE

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Guidelines for Establishment of Contract Areas Computer Science Department

Small PV Systems Performance Evaluation at NREL's Outdoor Test Facility Using the PVUSA Power Rating Method

Genetomic Promototypes

New Tools Using the Hardware Pertbrmaiihe... Monitor to Help Users Tune Programs on the Cray X-MP*

Light-Duty Vehicle Fuel Consumption Displacement Potential up to 2045

INTERNAL USE SITE LICENSE FOR THE CANCER GENOMICS BROWSER

LIMITATIONS ON HIGH DATA RATE OPTICAL FIBER TRAN-SMISSION SYSTEMS DUE TO TRANSMISSION IMPAIRMENT

Siebel CRM Reports. Easy to develop and deploy. Administration

Moving Target Reference Implementation

Situated Usability Testing for Security Systems

Web-based Document and Content Management with Off-the-Shelf Software

Data Management Facility Operations Plan

Pairwise Sequence Alignment

ORACLE FINANCIAL SERVICES ANALYTICAL APPLICATIONS INFRASTRUCTURE

Guide for Bioinformatics Project Module 3

File System-Aware Job Scheduling with Moab

How To Find Out If The Winds From The Oak Ridge Site Are Calm

The Edge Editions of SAP InfiniteInsight Overview

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

NREL Job Task Analysis: Crew Leader

Outage Notification. "Acknowledgment: This material is based upon work supported by the Department of Energy under Award Number DE-OE

The Value of RECs in Renewable Project Financing

Optical Blade Position Tracking System Test

Monitoring Nginx Server

The University is comprised of seven colleges and offers 19. including more than 5000 graduate students.

DOCUMENT CONTROL AND CONDUCT OF OPERATIONS. S. K. Collinsand F. L. Meltzer Idaho National EngineeringLaboratory EG&G Idaho, Inc. Idaho Falls, ld 83415

Nuclear Science and Technology Division (94) Multigroup Cross Section and Cross Section Covariance Data Visualization with Javapeño

S.L. Chang, S.A. Lottes, C.Q. Zhou,** B. Golchert, and M. Petrick

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Nuclear Engineering Enrollments and Degrees Survey, 2014 Data

A High Performance Computing Scheduling and Resource Management Primer

GeneProf and the new GeneProf Web Services

Bioinformatics Grid - Enabled Tools For Biologists.

Unicenter TCPaccess FTP Server

Technical document. Section 1 Using the website to investigate a specific B. rapa BAC.

2-DFINITE ELEMENT CABLE & BOX IEMP ANALYSIS

Automated Transportation Management System

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Methodology to Implement SAP Process Integration

Microgrids. EEI TDM Meeting Charleston, South Carolina October 7, Neal Bartek Smart Grid Projects Manager

W. C. Reinig. Savannah River Laboratory E. I. du Pent de Nemours and Company Aiken, South Carolina 298o1

Scaling up to Production

The world of non-coding RNA. Espen Enerly

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

GA A23745 STATUS OF THE LINUX PC CLUSTER FOR BETWEEN-PULSE DATA ANALYSES AT DIII D

End-to-End E-Clinical Coverage with Oracle Health Sciences InForm GTM

Arabidopsis. A Practical Approach. Edited by ZOE A. WILSON Plant Science Division, School of Biological Sciences, University of Nottingham

Transcription:

ORNL/CP-948 10 COAF-Yi7/w6--d GRAIL and GenQuest Sequence Annotation Tools* Ying Xu, Manesh B. Shah, J. Ralph Einstein, Morey Parang, Jay Snoddy, Sergey Petrov, Victor Olman, Ge Zhang, Richard J. Mural, and Edward C. Uberbacher Computational Biosciences Section Life Sciences Division Oak Ridge National Laboratory Oak Ridge, TN 37831-6480 "The submitted manuscript has been authorized by a contractor of the U.S. Government under contract No. DEAC05-960R22464. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S.Government purposes." 19980219 101 To be presented at the The Sixth Department of Energy Contractor and Grantee Workshop, Santa Fe, NM, November 9-13, 1997. *Research was supported by the Office of Health and Environmental Research, U.S. Department of Energy under contract No. DE-AC05-960R22464 with Lockheed Martin Energy Research Corporation. DTIC Q U M a*

DISCLAIMER This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, make any warranty, express or implied, or assumes any legal liabdity or respom-bility for the accuracy, completeness, or usefulness of any information, apparatus, product, or pmess disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

GRAIL and GenQuest Sequence Annotation Tools Ying Xu, Manesh B. Shah, J. Ralph Einstein, Morey Parang, Jay Snoddy, Sergey Petrov, Victor Olman, Ge Zhang, Richard J. Mural and Edward C. Uberbacher Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-1060 Email: GRAILMAIL@ornl.gov Our goal is to develop and implement an integrated intelligent system which can recognize biologically significant features in DNA sequence and provide insight into the organization and function of regions of genomic DNA. GRAIL is a modular expert system which facilitates the recognition of gene features and provides an environment for the construction of sequence annotation. The last several years have seen a rapid evolution of the technology for analyzing genomic DNA sequences. The current GRAIL systems (including the e-mail, XGRAIL, JAVAGRAIL and genquest systems) are perhaps the most widely used, comprehensive, and user friendly systems available for computational characterization of genomic DNA sequence. In the past 2 years of the project we have: Developed improved systems for recognition of exons, splice junctions, promoter elements and other features of biological importance, including greater sensitivity for exon prediction (especially in AT rich regions) and robust indel error detection capability. 0 Developed improved and more efficient algorithms for constructing models of the spliced mrna products of human genes. Developed and implemented methods for the analysis and visualization of sequence features including poly-a addition sites, potential Pol 11 promoters, CpG islands and repetitive DNA elements. Designed and implemented new methods for detecting potential sequence errors which can be used to "correct" frameshifts, add quality assurance to sequencing operations, and better detect coding regions in low pass sequences such as ESTs. Developed systems for a number of model organisms including mouse, Escherichia coli, Drosophila melanogaster, Arabidopsis thaliana, Saccharomyces cerevisiae and a number of microbial genomes. Implemented methods for the incorporation of protein, EST, and mrna sequence evidence in the multiple gene modeling process. ' Constructed a powerful and intuitive graphical user interface and client-server architecture which supports Unix workstations and JAVA Web-based access from many platforms. Improved algorithms and infrastructure in the genquest server, allowing characterization of newly obtained sequences by homology-based methods using a number of protein, DNA, and

motif databases and comparison methods such as FASTA, BLAST, parallel Smith-Waterman, and algorithms which consider potential frameshifts during sequence comparison. An improved "batch" GRAIL client allows users to analyze groups of short (300-400 bp) sequences for coding character (with frameshift compensation options) and automates database searches of translations of putative coding regions. 0 Provided support for GRAIL use in more than a thousand laboratories and at a rate of over 4000 analysis requests per month. The imminent wealth of genomic sequence data will present significant new challenges for sequence analysis systems. Our vision for the future entails incorporation of a more sophisticated view of biology into the GRAIL system. Computational systems for genome analysis have thus far focused on generic or textbook-like examples of single isolated genes which can be described fairly simply using the most usual assumptions, and fall far short of the intelligence necessary to interpret complex multiple gene domains. In its next phase the GRAIL project will involve the development of new pattern recognition methods and modeling algorithms for DNA sequence, expert systems for interpretation using experimental evidence and comparative genomics, and interoperation with other tools and databases. More specifically we will focus on several development areas: (1) Improved accuracy of feature recognition and greatly increased tolerance to sequencing errors, (2) development of technology to describe the structure and regulation of large, complex genomic regions containing multiple genes, (3) automated and interactive methods for the incorporation of experimental evidence such as ESTs, mrnas, and protein sequence homologs in multi-gene domains (GRAIL-EXP), (4) more comprehensive feature recognition and increased biological sophistication in the areas of expression and regulation, (5) capabilities for direct comparison of genomes, (6) a comprehensive suite of microbial genome analysis systems, (7)infrastructure for use of high-performance computing systems and specialized hardware to facilitate analysis and annotation of large volumes of sequence data, (8) improved interoperation with other tools, databases and methods for integrating information from multiple sources, particularly within the Genome Annotation Consortium framework and Genome Channel, and (9) continued community and user support, technology transfer, and educational outreach.

These developments will enable GRAIL to become more comprehensive and biologically sophisticated, and yet remain a user-friendly analysis environment which can be used interactively or in fully automated modes. GRAIL, and genquest related tools are available-as a Motif Graphical client (anonymous ftp from grail.lsd.ornl.gov (134.167.140.9)), through WWW interfaces (URL http://www.compbio.lsd.ornl.gov/), or by email server (graiz@ornz.gov) and genquest at (Q@ornZ.gov). Communications with the GRAIL staff should be addressed to GRAILMAIL@ornZ.gov. (Supported by the Office of Health and Environmental Research, United States Department of Energy, under contract DE-AC05-840R21400 with Martin Marietta Energy Systems, Inc.)

M98000582 I IB Il 1Il11Il1 1 l 1l 1I l 1 Il1 Publ. Date (11) 14q.7 09 F Sponsor Code (18) b6e /f%. I x UC Category (19) L( p-- 400, DOE