Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Size: px
Start display at page:

Download "Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004"

Transcription

1 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004

2 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2

3 Genomic & Proteomic Data Sequence Alignment was one of the first bioinformatics techniques (~1970) pre-dates high-throughput sequencing techniques PetaBytes Years Proteomic data GenBank

4 Motivation Structure Prediction Challenge 4

5 Guiding Principal The basic guiding principal is EVOLUTION Mutations to regions of the DNA/protein sequence that are functional units are less supportive of change Myoglobin vs Hemoglobin Zinc Finger 5

6 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 6

7 Pairwise Sequence Alignment One of the most commonly performed tasks in bioinformatics Method to compare two sequences and make inferences on the relationship between them Homologs = Two molecules that share a common ancestor 7

8 Searching for Homology Query (Unknown function/structure) >d1npx_ ( ) NADH peroxidase IPGKDLDNIYLMRGRQWAIKLKQKTVDPEVNNVVVIGSGYIGIEAAEAFAKAGKKVTVID ILDRPlGVYLDKEFTDVLTEEMEANNITIATGETVERYEGDGRVQKVVTDKNAYDADLVV VAV Target (Known function/structure) >d3lada ( ) Dihydrolipoamide dehydrogenase PAPVDQDVIVDSTGALDFQNVPGKLGVIGAGVIGLELGSVWARLGAEVTVLEAMDKFLPA VDEQVAKEAQKILTKQGLKILLGARVTGTEVKNKQVTVKFVDAEGEKSQAFDKLIVAVG Known Unknown 8

9 Pairwise Sequence Alignment One-to-one correspondence between the residues of two sequences R R (1) (2) = { R = { R (1) 1 (1) 1,..., R,..., R (1) I (1) J } = { HEAGAWGHE} } = { GHEE} Global Local HEAGAWGHE GHEE GHE GHE Almost all alignment is done at the local level 9

10 Why Sequence Alignment Algorithms? There are a huge number of alignments length ,000 Alignments length billion Exponential < O( n 2 ) n is the length of the longest sequence 10

11 Evolution (mutation) Coded for at DNA level Captured in 2 parameters Scoring matrices Gap penalties 11

12 Scoring Matrices Characterize the probability that one residue was substituted for another (log odds-ratio) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V BLOSUM

13 Gap Penalties Characterize the probability that a residue was inserted or deleted Linear d = gap opening penalty Affine γ ( g) = gd e = gap extension penalty γ ( g ) = d ( g 1) e 13

14 Typical Output Sequence Alignment means two residues are identical : means that two residues have similar physiochemical properties Typical Objective Function Score = end start ScoreMatrix end start GapPenalties 14

15 Pairwise Sequence Alignment Algorithms Exhaustive Dynamic Programming Needleman-Wunsch (1970) Smith-Waterman (1981) Approximate Heuristic Methods BLAST (PSI-BLAST) (1990, 1997) FASTA (1998) Statistical Bayes Block Aligner (1998) BALSA (2002) 15

16 Dynamic Programming Optimization method that uses sequential decisions to solve the problem Global Needleman-Wunsch Local Smith-Waterman Almost all alignment is done at the local level 16

17 Dynamic Programming Algorithms The optimal alignment, A*, is found by fixing the scoring matrix, Θ, and the gap penalties, Λ 0 0, and maximizing the log-likelihood (1) (2) (1) (2) log( P( R, R, A* Θ0, Λ0)) = max{log( P( R, R A, Θ0)) + log( P( A Λ0))} A scoring matrix Alignment (1) (2) s( R i, R j ) (1) gap penalties d = gap opening penalty e = gap extension penalty A i, j 1 = 0 if R i otherwise is aligned with R (2) j. 17

18 Standard Dynamic Programming Choices Match R R (1) i (2) j Insertion into Sequence 1 Deletion from Sequence 1 (1) R i - - (2) R j 18

19 F( i, Algorithm (1) F( i 1, j 1) + s( Ri, R F( i 1, j) d j) = max F( i, j 1) d 0 scoring matrix Gap Penalty d = Smith-Waterman Simple Gap Penalty (2) j ) G C G (+1) G A G 0 (-1) +1 (-1) (0) 0 1 (-1) (-1)

20 Reconstructing the Alignment 20

21 Smith-Waterman Availability affrc.go.jp/htdo cs/swsrch/ ac.uk/mpsrch/ tware/seqaln/se qaln-query.html 21

22 Results from SWsrch with hemoglobin Title: >hemoglobin mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp Amhasldkflasvstvltskyr Perfect Score: 750 Sequence: 1 MVLSADDKTNIKNCWGKIGG...HASLDKFLASVSTVLTSKYR 142 SUMMARIES % Result Query No. Score Match Length DB ID Description HART1(I54239;I68531;A26903;A93047;A90284;A90285;A02268) hemoglobin alpha-1 chain - rat &RATHBAM_1(M17083 pid: (P01946) Hemoglobin alpha-1 and alpha-2 c RN2A1GL_1(X56325 pid:g ) R.norvegicus AK003077_1(AK pid:none) Mus musculus a HAMS(A90791;I49720;A45964;I49722;I49721;B43560;A92945;) hemoglobin alpha chains - mouse &MMAGL1_1(V00714 pid: AK011076_1(AK pid:none) Mus musculus (P01942) Hemoglobin alpha chain (P20854) Hemoglobin alpha chain. &HARTNG L75940_1(L75940 pid:none) Mus musculus alpha AK010422_1(AK pid:none) Mus musculus E HASL1W(S10481)hemoglobin alpha-i chain - Wed (P01938) Hemoglobin alpha chain. &HALRN( (P09420) Hemoglobin alpha chain. &A (P01930) Hemoglobin alpha chain. &HAMQB( (P18969) Hemoglobin alpha chain. &HAFQL( (P01974) Hemoglobin alpha chain. &HACMA( HAMN2F(S11533)hemoglobin alpha-ii chain - do (P01945) Hemoglobin alpha chain. &HAHY(A (P15163) Hemoglobin alpha-i and alpha-ii (P01928) Hemoglobin alpha chain. &HAMQA( 22

23 Results from SWsrch with hemoglobin cont. RESULT 9 >L75940_1(L75940 pid:none) Mus musculus alpha-globin mrna, complete cds. &MUSALGL_1(L75940 pid:g ) Query Match 83.1%; Score 623; DB 1; Length 142; Best Local Similarity 83.8%; Pred. No. 1.80e-63; Matches 119; Conservative 7; Mismatches 16; Indels 0; Gaps 0; Inserts 0; InsGaps 0; Deletes 0; DelGaps 0; Db 1 mvlsgedksnikaawgkigghgaeyvaealermfasfpttktyfphfdvshgsaqvkghg Qy 1 mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg 60 Db 61 kkvadalasaaghlddlpgalsalsdlhahklrvdpvnfkllshcllvtlashhpadftp Qy 61 kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp 120 Db 121 avhasldkflasvstvltskyr Qy 121 amhasldkflasvstvltskyr

24 BLAST (Basic Alignment Search Tool) Indexes Database Calculated neighborhood of each word in query using scoring matrix and probability threshold Look up all words and neighbors from query database index Extends High-scoring Segment Pairs (HSPs) left and right to maximal length Finds maximal segment pairs (MSPs) between query and database 24

25 BLAST database search 25

26 PSI-BLAST (Position specific iterative BLAST) A profile (position specific scoring matrix, PSSM) is constructed from a multiple alignment of the highest scoring hits in a BLAST search The PSSM is generated by calculating position-specific scores for each position in the alignment. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. 26

27 Types of BLAST BLASTP search a Protein Sequence against a Protein Database. BLASTN search a Nucleotide Sequence against a Nucleotide Database. TBLASTN search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames. BLASTX Search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames PSI-BLAST Profile generated from identified homologs, used, and iteratively updated. (Especially good for identifying remote homologies) PHI-BLAST Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching 27

28 Assessing Evidence for Homology Is the score higher than expected from 2 random sequences (non-homologs) BLAST scores are not independent from the length of the sequences being aligned Fit extreme value distribution to randomly shuffled sequences BLAST returns the maximum score maximum of a larger number of i.i.d. random variables tends to an extreme distribution Expected number of HSPs for 2 sequences of length m and n λs E = Kmne Similar Approach for Smith-Waterman 28

29 BLAST Availability h.gov/education/bla STinfo/information3.h tml BLAST.html 29

30 Results from BLAST with hemoglobin Distribution of 100 Blast Hits on the Query Sequence Mouse-over to show defline and scores. Click to show alignments 30

31 Results from BLAST cont. Sequences producing significant alignments: (Score bits) E-Value gi ref NP_ hemoglobin, alpha 1 [Rattus nor e-55 gi ref XP_ similar to hemoglobin alpha ch e-54 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAC unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-45 gi sp P11755 HBA1_TADBR Hemoglobin alpha-1 chain >gi e-45 gi ref NP_ hemoglobin alpha, adult chain e-45 gi gb AAB alpha-globin [Mus musculus] 182 2e-45 gi sp P14387 HBA_ANTPA Hemoglobin alpha chain >gi e-45 gi pir HASL1W hemoglobin alpha-i chain - Weddell seal 180 7e-45 gi sp P18969 HBA_AILFU Hemoglobin alpha chain >gi e-44 gi pir HASHR2 hemoglobin alpha-ii chain - aoudad (t e-44 gi sp P01950 HBA_SUNMU Hemoglobin alpha chain >gi e-44 gi sp P26915 HBA_NASNA Hemoglobin alpha chain >gi e-44 gi sp Q9XSN3 HBA1_EQUBU Hemoglobin alpha-1 chain > e-44 31

32 Results from BLAST cont. >gi sp P14387 HBA_ANTPA Hemoglobin alpha chain gi pir A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 180 bits (457), Expect = 6e-45 Identities = 96/141 (68%), Positives = 104/141 (73%) Query: 2 VLSADDKTNIKNCWXXXXXXXXXXXXXALQRMFAAFPTTKTYFSHIDVSPGSAQVKAHGX 61 VLS DKTN+K W AL+RMF +FPTTKTYF H D+SGSAQVK HG Sbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK 60 Query: 62 XXXXXXXXXXXHVEDLPGALSTLSDLHAHKLRVDPVNFKFLSHCLLVTLACHHPGDFTPA 121 H++DLPGALS LSDLHA+KLRVDPVNFK LSHCLLVTLACHHPGDFTPA Sbjct: 61 KVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHPGDFTPA 120 Query: 122 MHASLDKFLASVSTVLTSKYR 142 +HASLDKFLASVSTVL SKYR Sbjct: 121 VHASLDKFLASVSTVLVSKYR

33 BALSA (Bayesian Algorithm for Local Sequence Alignment) Smith-Waterman recursion with sums Formulate sequence alignment as a Bayesian inference problem Everything is a random variable Allows multiple scoring matrices Make inferences on the parameters 33

34 34 BALSA Methodology BALSA Methodology BALSA Methodology Joint = likelihood*priors Priors a priori Algorithm ), ( ) ( ),, ( ),,,, ( (2) (1) (2) (1) Θ Λ Λ Θ = Λ Θ P A P A R R P A R R P Θ Λ = Θ Λ, 1 ), ( N P = = Λ A A k A l e A k o A k A l e A k o e o g g g g g g A P A P ) ( ) ( ) ( ) ( ) ( ) ( ), ( ) ( λ λ λ λ λ λ Θ = Λ = Θ Θ Λ A A k A l e A k o A A k A l e A k o A g g g g g g A R R P A P A R R P R R P ) ( ) ( ) ( ) ( ) ( ) ( (2) (1) (2) (1) (2) (1) ),, ( ) ( ),, ( ),, ( λ λ λ λ

35 Assessing Evidence for Homology The scores are independent of the length of the sequences being aligned. Score = P( R P( R (1) (1), R (2) ) P( R H ) (2) ) Directly Calculate Probability not homolgous from score P( H R (1), R (2) ) = Score 1 P( H ) P( H )

36 BALSA Availability /balsa/balsa.ht ml 36

37 Conclusions on Pairwise Sequence Alignment Choice of Algorithm is based on need Trade off between sensitivity and speed SCOP40 Sensitivity 1% EPQ BLAST 14.8% FASTA 16.7% SSEARCH 18.4% BALSA w/1 19.2% BALSA w/4 19.8% Speed 37

38 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 38

39 Multiple Sequence Alignment Multiple sequence alignment is generally concerned with finding structural or functional patterns between sequences Develop relationships and phylogenies Determine consensus sequence Build gene families Model protein structure for threading and fold prediction 39

40 Motivation: Example Motif Discovery 40

41 Motif -> > Scoring Matrix 41

42 Motif Finding Programs (search against a database of motifs) GCG SEQWEB Programs STRINGSEARCH FINDPATTERNS MOTIFS PROSITE web programs PROSITE - PROSITE SCAN emotif web programs emotif emotif-search emotif-scan 3MOTIF 42

43 Multiple Sequence Alignment Challenges Computation complexity O(n^k) for k sequences n long Space requirements O(n^k) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequence Approximations must be used for real world data Linked lists used to find exact words shared between k sequences BLAST can find inexact shared words between k sequences FASTA can be used to do progressive pair-wise alignments GIBBs sampling to find best overall alignment stochastically Final alignment is often dependent on order data presented Gaps make alignments unnaturally long 43

44 Multiple Alignment Multiple Sequence Aligner (MSA) Builds linked list of words GenAlign Iteratively adds sequences ClustalW Progressively adds sequences using clusters (a dendogram) Gibbs Sampling Generates a random alignment of size k and iteratively samples and updates the alignment until convergence 44

45 ClustalW Step 1 Generate all pairwise alignments Generate dendogram from alignment scores 45

46 ClustalW Step 2 Align most similar pair Align next most similar pair Combine 2 alignments 46

47 ClustalW General Approach 47

48 clustalw/ r.fr/seqanal/interface s/clustalw.html et.org/software/clust alw.html ClustalW Availability 48

49 Gibbs Sampling Traditional Gibbs sampling 1. Sample an alignment given parameters P( A Θ, R) 2. Sample parameters given an alignment P( Θ A, R) 49

50 Phylogenetic Footprinting Find DNA functional elements and signals in the non-coding region surrounding a gene 50

51 Transcription Regulation Gene Transcription and Regulation Transcription initiated by RNA polymerase binding Enhancers and repressors RNA polymerase Promoter region Starting codon 5 3 AUG Binding of Transcription factors inhibits or enhances expression 51

52 Example: Corepressor Transcription in process Transcription inhibited 52

53 Motif Alignment Model a 1 a 2 Motif width = w a k length n k The missing data: Alignment variable: A={a 1, a 2,, a k } Alignment: starting positions of binding sites in each sequence Apriori all positions equally likely Final alignment dependent on DNA sequence 53

54 Gibbs Sampler Algorithm Initialized by choosing random starting (0) (0) (0) positions a1, a2,..., ak Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k Stop when changes are infrequent, or some criterion met. 54

55 Gibbs Sampler Availability adsworth.org/gibb s/gibbs.html matics.ubc.ca/reso urces/tools/index.p hp?name=gibbs 55

56 Conclusions Sequence Analysis is the most commonly performed task in bioinformatics The choice of algorithm is dependent upon needs Pairwise Homology detection Multiple Motif detection Building Gene Families Phylogenetic footprinting The future is in whole genome comparisons 56

57 Other Sources of Information Extensive tutorials Bioinformatics Books Biological Sequence Analyis (Durbin et al.) Bioinformatics: The Machine Learning Approach (Baldi & Brunak) Computational Molecular Biology (Pevzner) Journal Articles Altschul et al. Journal of Molecular Biology, ; p (Original BLAST paper) Altschul et al. Nucleic Acids Research, ; p (PSI- BLAST Paper) McCue et al. Nucleic Acids Research, ; p (Phylogenetic Footprinting) 57

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

BLAST. Anders Gorm Pedersen & Rasmus Wernersson BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

Biological Databases and Protein Sequence Analysis

Biological Databases and Protein Sequence Analysis Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

BIOINFORMATICS TUTORIAL

BIOINFORMATICS TUTORIAL Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading: Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein

More information

GenBank, Entrez, & FASTA

GenBank, Entrez, & FASTA GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction

More information

Clone Manager. Getting Started

Clone Manager. Getting Started Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software

More information

Linear Sequence Analysis. 3-D Structure Analysis

Linear Sequence Analysis. 3-D Structure Analysis Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic

More information

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some

More information

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST BLAST Bioinformática Search for Homologies BLAST BLAST - Basic Local Alignment Search Tool http://blastncbinlmnihgov/blastcgi 1 2 Blast information guide Buscas de sequências semelhantes http://blastncbinlmnihgov/blastcgi?cmd=web&page_type=blastdocs

More information

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]

More information

Genome Explorer For Comparative Genome Analysis

Genome Explorer For Comparative Genome Analysis Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence

More information

Design Style of BLAST and FASTA and Their Importance in Human Genome.

Design Style of BLAST and FASTA and Their Importance in Human Genome. Design Style of BLAST and FASTA and Their Importance in Human Genome. Saba Khalid 1 and Najam-ul-haq 2 SZABIST Karachi, Pakistan Abstract: This subjected study will discuss the concept of BLAST and FASTA.BLAST

More information

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Tutorial in Genetic Sequence Classification Tools and Techniques A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide

More information

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need

More information

T cell Epitope Prediction

T cell Epitope Prediction Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments

More information

Searching Nucleotide Databases

Searching Nucleotide Databases Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues

More information

Human-Mouse Synteny in Functional Genomics Experiment

Human-Mouse Synteny in Functional Genomics Experiment Human-Mouse Synteny in Functional Genomics Experiment Ksenia Krasheninnikova University of the Russian Academy of Sciences, JetBrains [email protected] September 18, 2012 Ksenia Krasheninnikova

More information

Integration of data management and analysis for genome research

Integration of data management and analysis for genome research Integration of data management and analysis for genome research Volker Brendel Deparment of Zoology & Genetics and Department of Statistics Iowa State University 2112 Molecular Biology Building Ames, Iowa

More information

Biological Sequence Data Formats

Biological Sequence Data Formats Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Sequence homology search tools on the world wide web

Sequence homology search tools on the world wide web 44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: [email protected] Introduction Sequence homology

More information

Phylogenetic Trees Made Easy

Phylogenetic Trees Made Easy Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

Molecular Databases and Tools

Molecular Databases and Tools NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton

More information

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d. 13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both

More information

Introduction to Bioinformatics 3. DNA editing and contig assembly

Introduction to Bioinformatics 3. DNA editing and contig assembly Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]

More information

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.

More information

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

Activity 7.21 Transcription factors

Activity 7.21 Transcription factors Purpose To consolidate understanding of protein synthesis. To explain the role of transcription factors and hormones in switching genes on and off. Play the transcription initiation complex game Regulation

More information

Laboratorio di Bioinformatica

Laboratorio di Bioinformatica Laboratorio di Bioinformatica Lezione #2 Dr. Marco Fondi Contact: [email protected] www.unifi.it/dblemm/ tel. 0552288308 Dip.to di Biologia Evoluzionistica Laboratorio di Evoluzione Microbica e Molecolare,

More information

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,

More information

Sequencing the Human Genome

Sequencing the Human Genome Revised and Updated Edvo-Kit #339 Sequencing the Human Genome 339 Experiment Objective: In this experiment, students will read DNA sequences obtained from automated DNA sequencing techniques. The data

More information

HMM : Viterbi algorithm - a toy example

HMM : Viterbi algorithm - a toy example MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes

More information

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland

More information

UGENE Quick Start Guide

UGENE Quick Start Guide Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

Version 5.0 Release Notes

Version 5.0 Release Notes Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com

More information

Amino Acids and Their Properties

Amino Acids and Their Properties Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that

More information

GenBank: A Database of Genetic Sequence Data

GenBank: A Database of Genetic Sequence Data GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant

More information

Genetomic Promototypes

Genetomic Promototypes Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,

More information

Control of Gene Expression

Control of Gene Expression Control of Gene Expression What is Gene Expression? Gene expression is the process by which informa9on from a gene is used in the synthesis of a func9onal gene product. What is Gene Expression? Figure

More information

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using

More information

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins

More information

Core Bioinformatics. Degree Type Year Semester

Core Bioinformatics. Degree Type Year Semester Core Bioinformatics 2015/2016 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected] Teachers Use of

More information

DNA Sequencing Overview

DNA Sequencing Overview DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled

More information

3. About R2oDNA Designer

3. About R2oDNA Designer 3. About R2oDNA Designer Please read these publications for more details: Casini A, Christodoulou G, Freemont PS, Baldwin GS, Ellis T, MacDonald JT. R2oDNA Designer: Computational design of biologically-neutral

More information

Sequence information - lectures

Sequence information - lectures Sequence information - lectures Pairwise alignment Alignments in database searches Multiple alignments Profiles Patterns RNA secondary structure / Transformational grammars Genome organisation / Gene prediction

More information

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011 Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe [email protected] ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT

Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT Biology & Big Data Debasis Mitra Professor, Computer Science, FIT Cloud? Debasis Mitra, Florida Tech Data as Service Transparent to user Multiple locations Robustness Software as Service Software location

More information

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute

More information

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons Tutorial II Gene expression: mrna translation and protein synthesis Piergiorgio Percipalle, PhD Program Control of gene transcription and RNA processing mrna translation and protein synthesis KAROLINSKA

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information

Structure and Function of DNA

Structure and Function of DNA Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four

More information

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized: Module 3F Protein Synthesis So far in this unit, we have examined: How genes are transmitted from one generation to the next Where genes are located What genes are made of How genes are replicated How

More information

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks Semester II Paper II: Mathematics I 85 marks B.Sc. II Year

More information

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh

More information

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Molecular Genetics. RNA, Transcription, & Protein Synthesis Molecular Genetics RNA, Transcription, & Protein Synthesis Section 1 RNA AND TRANSCRIPTION Objectives Describe the primary functions of RNA Identify how RNA differs from DNA Describe the structure and

More information

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]

More information

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative

More information

Translation Study Guide

Translation Study Guide Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

2.3 Identify rrna sequences in DNA

2.3 Identify rrna sequences in DNA 2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by

More information

Global and Discovery Proteomics Lecture Agenda

Global and Discovery Proteomics Lecture Agenda Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global

More information