Bioinformatics & Protein Database Concepts. Learning Objective. Proteomics Bioinformatics and Protein Database Concepts

Similar documents
Guide for Bioinformatics Project Module 3

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Structure Tools and Visualization

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

PeptidomicsDB: a new platform for sharing MS/MS data.

Bioinformatics Resources at a Glance

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Separation of Amino Acids by Paper Chromatography

GenBank, Entrez, & FASTA

ProteinPilot Report for ProteinPilot Software

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

6 Characterization of Casein and Bovine Serum Albumin

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

Methods for Protein Analysis

Biochemistry - I. Prof. S. Dasgupta Department of Chemistry Indian Institute of Technology, Kharagpur Lecture-11 Enzyme Mechanisms II

Pesticide Analysis by Mass Spectrometry

Mass Frontier Version 7.0

INFRARED SPECTROSCOPY (IR)

ProSightPC 3.0 Quick Start Guide

Bioinformatics Grid - Enabled Tools For Biologists.

A Navigation through the Tracefinder Software Structure and Workflow Options. Frans Schoutsen Pesticide Symposium Prague 27 April 2015

(c) How would your answers to problem (a) change if the molecular weight of the protein was 100,000 Dalton?

Expression and Purification of Recombinant Protein in bacteria and Yeast. Presented By: Puspa pandey, Mohit sachdeva & Ming yu

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Lab 3 Organic Molecules of Biological Importance

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Tutorial for proteome data analysis using the Perseus software platform

Computational Systems Biology. Lecture 2: Enzymes

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

MASCOT Search Results Interpretation

Chapter 3. Protein Structure and Function

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

BIOC351: Proteins. PyMOL Laboratory #1. Installing and Using

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE


Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Introduction to Proteomics 1.0

Section I Using Jmol as a Computer Visualization Tool

Marmara Üniversitesi Fen-Edebiyat Fakültesi Kimya Bölümü / Biyokimya Anabilim Dalı PURIFICATION AND CHARACTERIZATION OF PROTEINS

A disaccharide is formed when a dehydration reaction joins two monosaccharides. This covalent bond is called a glycosidic linkage.

Organic Molecules of Life - Exercise 2

This class deals with the fundamental structural features of proteins, which one can understand from the structure of amino acids, and how they are

Analyzing A DNA Sequence Chromatogram

Sub menu of functions to give the user overall information about the data in the file

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Protein Prospector and Ways of Calculating Expectation Values

Introduction to Bioinformatics 3. DNA editing and contig assembly

INTRODUCTION TO PROTEIN STRUCTURE

Mascot Integra: Data management for Proteomics ASMS 2004

DBDB : a Disulfide Bridge DataBase for the predictive analysis of cysteine residues involved in disulfide bridges

Protease Peptide Microarrays Ready-to-use microarrays for protease profiling

Global and Discovery Proteomics Lecture Agenda

MassMatrix Web Server User Manual

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Science, Technology, Engineering & Mathematics Career Cluster

Introduction to Chemistry. Course Description

Chironomid DNA Barcode Database Search System. User Manual

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

T cell Epitope Prediction

18.2 Protein Structure and Function: An Overview

Definition of the Measurand: CRP

DNA Sequencing Overview

AP BIOLOGY 2008 SCORING GUIDELINES

Genomic DNA Extraction Kit INSTRUCTION MANUAL

Disaccharides consist of two monosaccharide monomers covalently linked by a glycosidic bond. They function in sugar transport.

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

Introduction to Bioinformatics AS Laboratory Assignment 6

Mass Frontier 7.0 Quick Start Guide

NO CALCULATORS OR CELL PHONES ALLOWED

Guide to Reverse Phase SpinColumns Chromatography for Sample Prep

Learning Objectives:

Biological Molecules

13C NMR Spectroscopy

Protein Sequence Analysis - Overview -

Carbohydrates, proteins and lipids

Searching Nucleotide Databases

ProteinQuest user guide

Unique Software Tools to Enable Quick Screening and Identification of Residues and Contaminants in Food Samples using Accurate Mass LC-MS/MS

Determination of Molecular Structure by MOLECULAR SPECTROSCOPY

4. Which carbohydrate would you find as part of a molecule of RNA? a. Galactose b. Deoxyribose c. Ribose d. Glucose

Biological Databases and Protein Sequence Analysis

BCHM Analytical Biochemistry Syllabus Spring, 2013

The Molecules of Cells

Organic Functional Groups Chapter 7. Alcohols, Ethers and More

HiPer Ion Exchange Chromatography Teaching Kit

Peptide Bonds: Structure

The Theory of HPLC. Gradient HPLC

LOS ANGELES MISSION COLLEGE-SUMMER 2013 CHEMISTRY 51-SECTIONS 0552 Lecture: MTWTh 10:35-12:40 ; Room: CMS-028 Lab: MTWTh 1:00-2:25 ; Room: CMS-201

Structure of proteins

Built from 20 kinds of amino acids

Chapter 3 Contd. Western blotting & SDS PAGE

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

ATLAS.ti for Mac OS X Getting Started

Molecule Shapes. 1

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

M.Sc. in Nano Technology with specialisation in Nano Biotechnology

Chapter 5: The Structure and Function of Large Biological Molecules

Isotope distributions

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Transcription:

Bioinformatics & Protein Database Concepts With the emergence of high-throughput techniques for generation of protein sequences, computational tools are required for storing, sharing, analyzing and updating this data. Databases and its associated features provide tools for accomplishing meaningful storage of biological data. Learning Objective In this Learning Object, the learner will be able to, Recall procedures involved in wet lab and Bioinformatics, and, Recall

From wet lab to Bioinformatics The cells present in the tissue culture are lysed open thereby releasing crude extract. This extract is centrifuged to separate the protein mixture from the cell debris. The supernatant obtained is made up of a mixture of proteins having a variety of properties. Protein of interest must then be isolated from this mixture.

From wet lab to Bioinformatics The protein of interest is separated from the protein mixture present in the supernatant. This is carried out by suitable techniques such as chromatography or electrophoresis which make use of various properties of the proteins such as their charge, mass etc for separation.

From wet lab to Bioinformatics Edman degradation employs pheny isothiocyanate reagent, which reacts with the amino terminal residue of the peptide giving rise to phenyl thiocarbamoyl derivative of the amino-acid reside. In mild acidic conditions, this cyclic derivative of the amino acid is released in the form of a PTH-amino acid, which can then be identified by chromatographic techniques. The procedure is then repeated to identify each N-terminal amino acid sequentially.

From wet lab to Bioinformatics The mass spectrometer is an instrument that produces charged molecular species in vacuum, separates them by means of electric and magnetic fields and measures the mass-to-charge ratios and relative abundances of the ions thus produced. A tandem mass spectrometer makes use of a combination of two mass analyzers, separated by a collision cell, in order to provide improved resolution of the fragment ions. The first mass analyzer usually operates in a scanning mode in order to select only a particular peptide ion which is further fragmented and resolved in the second analyzer. This can be used for protein sequencing studies.

All data related to a protein can be divided into four broad categories namely sequence details, Source, Gene details and References. Sequence details contain the features of a protein s amino acid sequence such as the length, location, patterns and identifiers of the protein sequence. The source contains information based on the biological source used for retrieving the protein. Gene contains details of the gene from which the proteins is being expressed. Reference contains the details of the research publication in which the study was reported.

Database designing is done at various levels such as Physical, Logical and View. At the physical level, we define the purpose of the database which is in accordance with the prospected usage. At the logical level, we define the tables, attributes of the tables and relationship between tables. Logical level is the most complex and important schema for databases and requires a thorough understanding of the data and its contexts and relationships. At the View level we define the views and appearance of the database

A typical biological database can be characterized by its Type and its Tools. The Type defines the category of data that it includes, such as sequence, domains or structure. This implies that the particular database s most prominent feature includes either sequences, domains or structure and it will primarily be used for their analysis. The analysis tools defines the platforms that the site will provide for gaining an insight into the protein data.

For extracting the protein information from a database, users can give a variety of input terms. These can be: Unique ID: Molecular Name Amino-acid sequence Keyword Literature Gene Taxonomy

Once the user submits the query, the output can be of multiple formats. The generalized information that users can obtain from protein databases is the protein s General Description of the protein molecule Annotations of the protein Name and description of the gene that transcribes them ID of the same protein in other relevant databases Details of the experiment conducted for characterizing proteins Details of the Protein s secondary structure Details of the organism which was used as a source for obtaining the protein Citations of research conducted for obtaining this protein Patterns occurring within a sequence and their analysis

This slide shows the different kinds of analysis that can be conducted on a given protein sequence. The query can be the protein name, sequence or any other identifier of the protein. In this example, we provide the protein sequence as Input. Once the query protein sequence is entered into the Analysis tool, it can give various kinds of results such as Identify protein from sequence Identify physico-chemical properties such as chemical formula, half-life, iso-electric point, molecular weight, etc. Aligned sequences and structures Variable and conserved residues Predicted Secondary and Tertiary Structures Synonyms and Scientific terminology of proteins

We explain the usage of Protein databases using the example of Human Serum Albumin protein. If you want to view a specific step in the case study, click on the relevant panel. Else click on View Full Animation

Open a web browser and go to http://expssy.org/sprot/. On the top right corner of thepage, there will be a search box. Click on the downlink ahead of the search box. We get a list of options for the databases to search from. Select UniProKB. Type the name of the protein of your choice (Ex-Serum Albumin) in the text box in front of the word 'for'.

The results page for the search shows 179 hits for our query. It is shown on the top of the page. The first 25 of them are shown on the first page, which can be viewed by scrolling down the page. Click on the entry of your choice. Here we click on the human Albumin hit (ALBU_HUMAN).

The top of the result page looks like this. Search for the heading Sequences, by scrolling down the page. Click on the tab FASTA next to the sequence of your interest. The FASTA sequence opens on a new tab. Save this FASTA sequence in your computer.

Once the FASTA sequence is retreived, we can subject it to variety of Protein Analysis toools which are broadly classified into Sequence Similarity search tools, Primary structural analysis tools, Phylogenetic Analysis tools, Molecular Modeling and Visualisation Tools and Structure Prediction tools. Here we explore the web based service called ProtParam which belongs to Primary Structural Analysis tools. For exploring other such services, users can visit http://expasy.org/sprot/

The front-end for the tool will ask you to input the accession ID of the protein under study OR the sequence of that protein. Delete the first line (descriptive line) from your FASTA sequence, such that only the amino acid sequence is there. Click on Compute Parameters. On the results page, scroll down to find the various physico-chemical parameters of this protein

This part of the results gives the percentage of each amino acid in the sequence. The highlighted region indicates the CSV file link. CSV stands for Comma Separated Values. which can be opened from text as well as spread sheet formats. This file can be downloaded in its comma separated format, by clicking on it. CSV files can also be opened with Microsoft Excel

Other information that can be obtained from these databases include chemical formula for the protein, total number of atoms present in the protein, total number of negatively and positively charged residues, estimated half-life of the protein, i.e. the time in which the protein will degrade to half its original mass and the average hydropathicity which gives an insight into the solubility of the proteins. Hydrophobic molecules exhibit a Positive GRAVY value while hydrophilic molecules show a negative GRAVY value

Go to http://expasy.org/prosite/.input the FASTA sequence obtained in previous steps into the input box of the server. Click on Scan.

The results page shows the various profiles that have the highest probability of occurrence on the basis of which they are assigned scores. You should select the hit with the highest score

The result displays the position of the Albumin domain highlighted in the sequence from position 210-402. It also displays a graphical view in form of a downloadable png image where the Profile hits are represented as colored shapes with their PROSITE name. It then displays the structure of the Albumin Domain highlighting the di-sulhphide bonding cysteine residues as C and and its signature pattern as *

Once the user enters Serum Albumin in the PDB search box, in the output page of the selected PDB entry, we find the following tabs. The horizontal tabs summarize the entire result page. The vertical tabs occur as the initial description in the first page. Each of these tabs can be explored in detail. The structural analysis of the protein can display a wide range of properties such as the description of the protein molecule including classification of the protein, the chains it contains, number of amino acids, etc.

The display also shows entries that are closely related to the user s query, such as in the case of the same protein characterized from a different organism.

The protein molecules are generally structurally characterized by attaching it with a ligand and determining its structure from experimental techniques. The description of these ligands is given in the result summary of the query protein

Result summary displays derived data for the Serum Albumin such as the molecular and biological functions that the protein is involved in.

The Biological aspect of Serum Albumin are also displayed as results. The unique feature of this tab is that it gives a complete list of Single Nucleotide Polymorphisms (SNP) in the protein sequence. This shows the change in amino acids as well as the locations of the SNPs and the SNP Ids.

The 3-D visualization of Serum Albumin is given as a part of the results which can be viewed from a tool called Jmol. Along with the image analysis from Jmol, users can also study and download the structural characteristics of the protein such as its Bond Length along with the place and frequency of its occurrence. Structural results also summarize the Bond Angle and the Dihedral Angles including the chain where they occur and the frequency of its occurrence.

From wet lab to bioinformatics 1. Protein: Protein is a bio-molecule made out of chains of amino acid residues. These chains are formed between amino-acids by eliminating a water molecule and forming a peptide bond. Proteins are involved in performing the structural, functional and regulatory functions of the cell. 2. Peptide: Small protein fragments which are formed by a stretch of around 50 amino-acids are called peptides 3. Amino acid sequence: The order of amino acids and their linear arrangement is known as amino-acid sequence. It is also known as the primary structure of the protein. 4. Edman degradation: This is a chemical method for sequencing amino acid residues in a protein or a peptide. The N-terminal residue is labelled using phenyl isothiocyanate and then cleaved from the remaining peptide chain without disrupting any of the other peptide bonds. This labelled amino acid is then detected and the procedure is repeated to identify each N-terminal amino acid sequentially. 5. Mass spectrometry: A technique for production and detection of charged molecular species in vacuum, after their separation by magnetic and electric fields based on mass to charge (m/z) ratio.

1. Type of data: The type of data stored in Biological Databases can be of various types such as Pure Sequences, Sequences with structure, meta-data about the source of the sequence, experimental detail, etc. 2. Prospected Usage: The databases are primarily used to store all the information in a single web-based resource. It also provide analysis tools for various sequence analysis functions such as pair-wise sequence alignment, multiple sequence alignment, homology modelling, etc 3. Database schema: The design of the database at various levels is called a database schema. It includes the attributes of all individual tables and the relationships between them. The schema is defined at three levels, namely, Physical, Logical and View. 4. Primary Database: In biological database studies, primary databases store only the protein sequence information.