Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
|
|
|
- Tamsyn Hubbard
- 9 years ago
- Views:
Transcription
1 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004
2 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2
3 Genomic & Proteomic Data Sequence Alignment was one of the first bioinformatics techniques (~1970) pre-dates high-throughput sequencing techniques PetaBytes Years Proteomic data GenBank
4 Motivation Structure Prediction Challenge 4
5 Guiding Principal The basic guiding principal is EVOLUTION Mutations to regions of the DNA/protein sequence that are functional units are less supportive of change Myoglobin vs Hemoglobin Zinc Finger 5
6 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 6
7 Pairwise Sequence Alignment One of the most commonly performed tasks in bioinformatics Method to compare two sequences and make inferences on the relationship between them Homologs = Two molecules that share a common ancestor 7
8 Searching for Homology Query (Unknown function/structure) >d1npx_ ( ) NADH peroxidase IPGKDLDNIYLMRGRQWAIKLKQKTVDPEVNNVVVIGSGYIGIEAAEAFAKAGKKVTVID ILDRPlGVYLDKEFTDVLTEEMEANNITIATGETVERYEGDGRVQKVVTDKNAYDADLVV VAV Target (Known function/structure) >d3lada ( ) Dihydrolipoamide dehydrogenase PAPVDQDVIVDSTGALDFQNVPGKLGVIGAGVIGLELGSVWARLGAEVTVLEAMDKFLPA VDEQVAKEAQKILTKQGLKILLGARVTGTEVKNKQVTVKFVDAEGEKSQAFDKLIVAVG Known Unknown 8
9 Pairwise Sequence Alignment One-to-one correspondence between the residues of two sequences R R (1) (2) = { R = { R (1) 1 (1) 1,..., R,..., R (1) I (1) J } = { HEAGAWGHE} } = { GHEE} Global Local HEAGAWGHE GHEE GHE GHE Almost all alignment is done at the local level 9
10 Why Sequence Alignment Algorithms? There are a huge number of alignments length ,000 Alignments length billion Exponential < O( n 2 ) n is the length of the longest sequence 10
11 Evolution (mutation) Coded for at DNA level Captured in 2 parameters Scoring matrices Gap penalties 11
12 Scoring Matrices Characterize the probability that one residue was substituted for another (log odds-ratio) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V BLOSUM
13 Gap Penalties Characterize the probability that a residue was inserted or deleted Linear d = gap opening penalty Affine γ ( g) = gd e = gap extension penalty γ ( g ) = d ( g 1) e 13
14 Typical Output Sequence Alignment means two residues are identical : means that two residues have similar physiochemical properties Typical Objective Function Score = end start ScoreMatrix end start GapPenalties 14
15 Pairwise Sequence Alignment Algorithms Exhaustive Dynamic Programming Needleman-Wunsch (1970) Smith-Waterman (1981) Approximate Heuristic Methods BLAST (PSI-BLAST) (1990, 1997) FASTA (1998) Statistical Bayes Block Aligner (1998) BALSA (2002) 15
16 Dynamic Programming Optimization method that uses sequential decisions to solve the problem Global Needleman-Wunsch Local Smith-Waterman Almost all alignment is done at the local level 16
17 Dynamic Programming Algorithms The optimal alignment, A*, is found by fixing the scoring matrix, Θ, and the gap penalties, Λ 0 0, and maximizing the log-likelihood (1) (2) (1) (2) log( P( R, R, A* Θ0, Λ0)) = max{log( P( R, R A, Θ0)) + log( P( A Λ0))} A scoring matrix Alignment (1) (2) s( R i, R j ) (1) gap penalties d = gap opening penalty e = gap extension penalty A i, j 1 = 0 if R i otherwise is aligned with R (2) j. 17
18 Standard Dynamic Programming Choices Match R R (1) i (2) j Insertion into Sequence 1 Deletion from Sequence 1 (1) R i - - (2) R j 18
19 F( i, Algorithm (1) F( i 1, j 1) + s( Ri, R F( i 1, j) d j) = max F( i, j 1) d 0 scoring matrix Gap Penalty d = Smith-Waterman Simple Gap Penalty (2) j ) G C G (+1) G A G 0 (-1) +1 (-1) (0) 0 1 (-1) (-1)
20 Reconstructing the Alignment 20
21 Smith-Waterman Availability affrc.go.jp/htdo cs/swsrch/ ac.uk/mpsrch/ tware/seqaln/se qaln-query.html 21
22 Results from SWsrch with hemoglobin Title: >hemoglobin mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp Amhasldkflasvstvltskyr Perfect Score: 750 Sequence: 1 MVLSADDKTNIKNCWGKIGG...HASLDKFLASVSTVLTSKYR 142 SUMMARIES % Result Query No. Score Match Length DB ID Description HART1(I54239;I68531;A26903;A93047;A90284;A90285;A02268) hemoglobin alpha-1 chain - rat &RATHBAM_1(M17083 pid: (P01946) Hemoglobin alpha-1 and alpha-2 c RN2A1GL_1(X56325 pid:g ) R.norvegicus AK003077_1(AK pid:none) Mus musculus a HAMS(A90791;I49720;A45964;I49722;I49721;B43560;A92945;) hemoglobin alpha chains - mouse &MMAGL1_1(V00714 pid: AK011076_1(AK pid:none) Mus musculus (P01942) Hemoglobin alpha chain (P20854) Hemoglobin alpha chain. &HARTNG L75940_1(L75940 pid:none) Mus musculus alpha AK010422_1(AK pid:none) Mus musculus E HASL1W(S10481)hemoglobin alpha-i chain - Wed (P01938) Hemoglobin alpha chain. &HALRN( (P09420) Hemoglobin alpha chain. &A (P01930) Hemoglobin alpha chain. &HAMQB( (P18969) Hemoglobin alpha chain. &HAFQL( (P01974) Hemoglobin alpha chain. &HACMA( HAMN2F(S11533)hemoglobin alpha-ii chain - do (P01945) Hemoglobin alpha chain. &HAHY(A (P15163) Hemoglobin alpha-i and alpha-ii (P01928) Hemoglobin alpha chain. &HAMQA( 22
23 Results from SWsrch with hemoglobin cont. RESULT 9 >L75940_1(L75940 pid:none) Mus musculus alpha-globin mrna, complete cds. &MUSALGL_1(L75940 pid:g ) Query Match 83.1%; Score 623; DB 1; Length 142; Best Local Similarity 83.8%; Pred. No. 1.80e-63; Matches 119; Conservative 7; Mismatches 16; Indels 0; Gaps 0; Inserts 0; InsGaps 0; Deletes 0; DelGaps 0; Db 1 mvlsgedksnikaawgkigghgaeyvaealermfasfpttktyfphfdvshgsaqvkghg Qy 1 mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg 60 Db 61 kkvadalasaaghlddlpgalsalsdlhahklrvdpvnfkllshcllvtlashhpadftp Qy 61 kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp 120 Db 121 avhasldkflasvstvltskyr Qy 121 amhasldkflasvstvltskyr
24 BLAST (Basic Alignment Search Tool) Indexes Database Calculated neighborhood of each word in query using scoring matrix and probability threshold Look up all words and neighbors from query database index Extends High-scoring Segment Pairs (HSPs) left and right to maximal length Finds maximal segment pairs (MSPs) between query and database 24
25 BLAST database search 25
26 PSI-BLAST (Position specific iterative BLAST) A profile (position specific scoring matrix, PSSM) is constructed from a multiple alignment of the highest scoring hits in a BLAST search The PSSM is generated by calculating position-specific scores for each position in the alignment. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. 26
27 Types of BLAST BLASTP search a Protein Sequence against a Protein Database. BLASTN search a Nucleotide Sequence against a Nucleotide Database. TBLASTN search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames. BLASTX Search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames PSI-BLAST Profile generated from identified homologs, used, and iteratively updated. (Especially good for identifying remote homologies) PHI-BLAST Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching 27
28 Assessing Evidence for Homology Is the score higher than expected from 2 random sequences (non-homologs) BLAST scores are not independent from the length of the sequences being aligned Fit extreme value distribution to randomly shuffled sequences BLAST returns the maximum score maximum of a larger number of i.i.d. random variables tends to an extreme distribution Expected number of HSPs for 2 sequences of length m and n λs E = Kmne Similar Approach for Smith-Waterman 28
29 BLAST Availability h.gov/education/bla STinfo/information3.h tml BLAST.html 29
30 Results from BLAST with hemoglobin Distribution of 100 Blast Hits on the Query Sequence Mouse-over to show defline and scores. Click to show alignments 30
31 Results from BLAST cont. Sequences producing significant alignments: (Score bits) E-Value gi ref NP_ hemoglobin, alpha 1 [Rattus nor e-55 gi ref XP_ similar to hemoglobin alpha ch e-54 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAC unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-45 gi sp P11755 HBA1_TADBR Hemoglobin alpha-1 chain >gi e-45 gi ref NP_ hemoglobin alpha, adult chain e-45 gi gb AAB alpha-globin [Mus musculus] 182 2e-45 gi sp P14387 HBA_ANTPA Hemoglobin alpha chain >gi e-45 gi pir HASL1W hemoglobin alpha-i chain - Weddell seal 180 7e-45 gi sp P18969 HBA_AILFU Hemoglobin alpha chain >gi e-44 gi pir HASHR2 hemoglobin alpha-ii chain - aoudad (t e-44 gi sp P01950 HBA_SUNMU Hemoglobin alpha chain >gi e-44 gi sp P26915 HBA_NASNA Hemoglobin alpha chain >gi e-44 gi sp Q9XSN3 HBA1_EQUBU Hemoglobin alpha-1 chain > e-44 31
32 Results from BLAST cont. >gi sp P14387 HBA_ANTPA Hemoglobin alpha chain gi pir A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 180 bits (457), Expect = 6e-45 Identities = 96/141 (68%), Positives = 104/141 (73%) Query: 2 VLSADDKTNIKNCWXXXXXXXXXXXXXALQRMFAAFPTTKTYFSHIDVSPGSAQVKAHGX 61 VLS DKTN+K W AL+RMF +FPTTKTYF H D+SGSAQVK HG Sbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK 60 Query: 62 XXXXXXXXXXXHVEDLPGALSTLSDLHAHKLRVDPVNFKFLSHCLLVTLACHHPGDFTPA 121 H++DLPGALS LSDLHA+KLRVDPVNFK LSHCLLVTLACHHPGDFTPA Sbjct: 61 KVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHPGDFTPA 120 Query: 122 MHASLDKFLASVSTVLTSKYR 142 +HASLDKFLASVSTVL SKYR Sbjct: 121 VHASLDKFLASVSTVLVSKYR
33 BALSA (Bayesian Algorithm for Local Sequence Alignment) Smith-Waterman recursion with sums Formulate sequence alignment as a Bayesian inference problem Everything is a random variable Allows multiple scoring matrices Make inferences on the parameters 33
34 34 BALSA Methodology BALSA Methodology BALSA Methodology Joint = likelihood*priors Priors a priori Algorithm ), ( ) ( ),, ( ),,,, ( (2) (1) (2) (1) Θ Λ Λ Θ = Λ Θ P A P A R R P A R R P Θ Λ = Θ Λ, 1 ), ( N P = = Λ A A k A l e A k o A k A l e A k o e o g g g g g g A P A P ) ( ) ( ) ( ) ( ) ( ) ( ), ( ) ( λ λ λ λ λ λ Θ = Λ = Θ Θ Λ A A k A l e A k o A A k A l e A k o A g g g g g g A R R P A P A R R P R R P ) ( ) ( ) ( ) ( ) ( ) ( (2) (1) (2) (1) (2) (1) ),, ( ) ( ),, ( ),, ( λ λ λ λ
35 Assessing Evidence for Homology The scores are independent of the length of the sequences being aligned. Score = P( R P( R (1) (1), R (2) ) P( R H ) (2) ) Directly Calculate Probability not homolgous from score P( H R (1), R (2) ) = Score 1 P( H ) P( H )
36 BALSA Availability /balsa/balsa.ht ml 36
37 Conclusions on Pairwise Sequence Alignment Choice of Algorithm is based on need Trade off between sensitivity and speed SCOP40 Sensitivity 1% EPQ BLAST 14.8% FASTA 16.7% SSEARCH 18.4% BALSA w/1 19.2% BALSA w/4 19.8% Speed 37
38 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 38
39 Multiple Sequence Alignment Multiple sequence alignment is generally concerned with finding structural or functional patterns between sequences Develop relationships and phylogenies Determine consensus sequence Build gene families Model protein structure for threading and fold prediction 39
40 Motivation: Example Motif Discovery 40
41 Motif -> > Scoring Matrix 41
42 Motif Finding Programs (search against a database of motifs) GCG SEQWEB Programs STRINGSEARCH FINDPATTERNS MOTIFS PROSITE web programs PROSITE - PROSITE SCAN emotif web programs emotif emotif-search emotif-scan 3MOTIF 42
43 Multiple Sequence Alignment Challenges Computation complexity O(n^k) for k sequences n long Space requirements O(n^k) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequence Approximations must be used for real world data Linked lists used to find exact words shared between k sequences BLAST can find inexact shared words between k sequences FASTA can be used to do progressive pair-wise alignments GIBBs sampling to find best overall alignment stochastically Final alignment is often dependent on order data presented Gaps make alignments unnaturally long 43
44 Multiple Alignment Multiple Sequence Aligner (MSA) Builds linked list of words GenAlign Iteratively adds sequences ClustalW Progressively adds sequences using clusters (a dendogram) Gibbs Sampling Generates a random alignment of size k and iteratively samples and updates the alignment until convergence 44
45 ClustalW Step 1 Generate all pairwise alignments Generate dendogram from alignment scores 45
46 ClustalW Step 2 Align most similar pair Align next most similar pair Combine 2 alignments 46
47 ClustalW General Approach 47
48 clustalw/ r.fr/seqanal/interface s/clustalw.html et.org/software/clust alw.html ClustalW Availability 48
49 Gibbs Sampling Traditional Gibbs sampling 1. Sample an alignment given parameters P( A Θ, R) 2. Sample parameters given an alignment P( Θ A, R) 49
50 Phylogenetic Footprinting Find DNA functional elements and signals in the non-coding region surrounding a gene 50
51 Transcription Regulation Gene Transcription and Regulation Transcription initiated by RNA polymerase binding Enhancers and repressors RNA polymerase Promoter region Starting codon 5 3 AUG Binding of Transcription factors inhibits or enhances expression 51
52 Example: Corepressor Transcription in process Transcription inhibited 52
53 Motif Alignment Model a 1 a 2 Motif width = w a k length n k The missing data: Alignment variable: A={a 1, a 2,, a k } Alignment: starting positions of binding sites in each sequence Apriori all positions equally likely Final alignment dependent on DNA sequence 53
54 Gibbs Sampler Algorithm Initialized by choosing random starting (0) (0) (0) positions a1, a2,..., ak Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k Stop when changes are infrequent, or some criterion met. 54
55 Gibbs Sampler Availability adsworth.org/gibb s/gibbs.html matics.ubc.ca/reso urces/tools/index.p hp?name=gibbs 55
56 Conclusions Sequence Analysis is the most commonly performed task in bioinformatics The choice of algorithm is dependent upon needs Pairwise Homology detection Multiple Motif detection Building Gene Families Phylogenetic footprinting The future is in whole genome comparisons 56
57 Other Sources of Information Extensive tutorials Bioinformatics Books Biological Sequence Analyis (Durbin et al.) Bioinformatics: The Machine Learning Approach (Baldi & Brunak) Computational Molecular Biology (Pevzner) Journal Articles Altschul et al. Journal of Molecular Biology, ; p (Original BLAST paper) Altschul et al. Nucleic Acids Research, ; p (PSI- BLAST Paper) McCue et al. Nucleic Acids Research, ; p (Phylogenetic Footprinting) 57
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
BIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Clone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Linear Sequence Analysis. 3-D Structure Analysis
Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic
Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST
BLAST Bioinformática Search for Homologies BLAST BLAST - Basic Local Alignment Search Tool http://blastncbinlmnihgov/blastcgi 1 2 Blast information guide Buscas de sequências semelhantes http://blastncbinlmnihgov/blastcgi?cmd=web&page_type=blastdocs
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
Design Style of BLAST and FASTA and Their Importance in Human Genome.
Design Style of BLAST and FASTA and Their Importance in Human Genome. Saba Khalid 1 and Najam-ul-haq 2 SZABIST Karachi, Pakistan Abstract: This subjected study will discuss the concept of BLAST and FASTA.BLAST
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
T cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
Human-Mouse Synteny in Functional Genomics Experiment
Human-Mouse Synteny in Functional Genomics Experiment Ksenia Krasheninnikova University of the Russian Academy of Sciences, JetBrains [email protected] September 18, 2012 Ksenia Krasheninnikova
Integration of data management and analysis for genome research
Integration of data management and analysis for genome research Volker Brendel Deparment of Zoology & Genetics and Department of Statistics Iowa State University 2112 Molecular Biology Building Ames, Iowa
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Sequence homology search tools on the world wide web
44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: [email protected] Introduction Sequence homology
Phylogenetic Trees Made Easy
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Molecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov
Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray
Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0
Protein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.
13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
DNA Insertions and Deletions in the Human Genome. Philipp W. Messer
DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.
From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Frequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
Activity 7.21 Transcription factors
Purpose To consolidate understanding of protein synthesis. To explain the role of transcription factors and hormones in switching genes on and off. Play the transcription initiation complex game Regulation
Laboratorio di Bioinformatica
Laboratorio di Bioinformatica Lezione #2 Dr. Marco Fondi Contact: [email protected] www.unifi.it/dblemm/ tel. 0552288308 Dip.to di Biologia Evoluzionistica Laboratorio di Evoluzione Microbica e Molecolare,
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Sequencing the Human Genome
Revised and Updated Edvo-Kit #339 Sequencing the Human Genome 339 Experiment Objective: In this experiment, students will read DNA sequences obtained from automated DNA sequencing techniques. The data
HMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
UGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
Amino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
GenBank: A Database of Genetic Sequence Data
GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant
Genetomic Promototypes
Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,
Control of Gene Expression
Control of Gene Expression What is Gene Expression? Gene expression is the process by which informa9on from a gene is used in the synthesis of a func9onal gene product. What is Gene Expression? Figure
MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
Core Bioinformatics. Degree Type Year Semester
Core Bioinformatics 2015/2016 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected] Teachers Use of
DNA Sequencing Overview
DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled
3. About R2oDNA Designer
3. About R2oDNA Designer Please read these publications for more details: Casini A, Christodoulou G, Freemont PS, Baldwin GS, Ellis T, MacDonald JT. R2oDNA Designer: Computational design of biologically-neutral
Sequence information - lectures
Sequence information - lectures Pairwise alignment Alignments in database searches Multiple alignments Profiles Patterns RNA secondary structure / Transformational grammars Genome organisation / Gene prediction
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Network Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe [email protected] ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT
Biology & Big Data Debasis Mitra Professor, Computer Science, FIT Cloud? Debasis Mitra, Florida Tech Data as Service Transparent to user Multiple locations Robustness Software as Service Software location
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute
Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons
Tutorial II Gene expression: mrna translation and protein synthesis Piergiorgio Percipalle, PhD Program Control of gene transcription and RNA processing mrna translation and protein synthesis KAROLINSKA
Comparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:
Module 3F Protein Synthesis So far in this unit, we have examined: How genes are transmitted from one generation to the next Where genes are located What genes are made of How genes are replicated How
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks Semester II Paper II: Mathematics I 85 marks B.Sc. II Year
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh
Molecular Genetics. RNA, Transcription, & Protein Synthesis
Molecular Genetics RNA, Transcription, & Protein Synthesis Section 1 RNA AND TRANSCRIPTION Objectives Describe the primary functions of RNA Identify how RNA differs from DNA Describe the structure and
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
Translation Study Guide
Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
2.3 Identify rrna sequences in DNA
2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by
Global and Discovery Proteomics Lecture Agenda
Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global
