Bio-Informatics Lectures A Short Introduction
The History of Bioinformatics
Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides
Massively Parallel Sequencing
Massively Parallel Sequencing Illumina/Solexa
Roche/454, Emulsion PCR Metzker, Nature Review: Genetics (11):31-46
Illumina/Solexa: Solid-Phase Amplification
http://www.genome.gov/sequencingcosts/
http://www.genome.gov/sequencingcosts/
Growth of GenBank and WGS 1000 billion bases ~200 million sequences http://www.ncbi.nlm.nih.gov/genbank/statistics
Growth of UniProtKB/TrEMBL http://www.ebi.ac.uk/uniprot/tremblstats
How Does the Sequence Information Tell Us?
How Does the Sequence Information Tell Us? Bio-Informatics
Scope of this lab 1. Be familiar with sequence databases and some online bioinformatics tools DATABASES: GenBank-http://www.ncbi.nlm.nih.gov EMBL-http://www.ebi.ac.uk DDBJ-http://www.ddbj.nig.ac.jp Sequence Search and Retrieval: BLAST Sequence Alignement: ClustalW2, MAFFT Sequences Analysis and Domain Search: Pfam and SMART Protein Structure and Prediction: Pymol Molecular Evolution: MEGA More Tools to Discover on Your Own http://www.ebi.ac.uk/services/all http://www.expasy.org
Online Tools
Scope of this lab 2. Touch Some Simple Programming (Stand-alone) Basic UNIX Commands: cd, mkdir, mv. cp, rm, cat, ls, pwd, gunzip, unzip, tar Perl: String, Array, Hash R: Read a file, column, row, plot, hist, heat map
Beginning with a DNA Sequence
Proteins N-termnus MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQ RLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGG C-termnus The primary sequence, structure, and function of a protein are inter-related
Database Sequence Similarity Searching Definition: Applies computation, mathematical algorithms, statistical inference to rapidly find similar sequences (hits) to a target (query) sequence from a database. All similarity searching methods rely on the concepts of alignment between sequences. A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.
Edit Distance
Edit Distance
Sequence Alignement and Dynamic Programming
Sequence Alignement Comparison and Substitution Matrix Some popular scoring matrices are: PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix Some popular scoring matrices are: PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix Some popular scoring matrices are: PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix Some popular scoring matrices are: PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix
Sequence Alignement Comparison and Substitution Matrix Log-odds matrices
Local and Global Alignements Needleman-Wunsch Smith-Waterman
BLAST/FASTA Search and k-tuple Method
Use proteins for database similarity searches when possible
Lab 1 Sequence Search and Retrieval: BLAST Sequence Alignement: ClustalW2, MAFFT Sequences Analysis and Domain Search: Pfam and SMART Protein Structure and Prediction: Pymol Molecular Evolution: MEGA Sequence Format - Fasta >AT4G05320 ATGCAGATCTTTGTTAAGACTCTCACCGGAAAGACAATCACCCTCGAGGTGGAAAGCTCCGACACCATCGACAACGTTAAGGC CAAGATCCAGGATAAGGAGGGCATTCCTCCGGATCAGCAGAGGCTTATTTTCGCCGGCAAGCAGCTAGAGGATGGCCGTACG TTGGCTGATTACAATATCCAGAAGGAATCCACCCTCCACTTGGTCCTCAGGCTCCGTGGTGGTATGCAGATTTTCGTTAAAACC CTAACGGGAAAGACGATTACTCTTGAGGTGGAGAGTTCTGACACCATCGACAACGTCAAGGCCAAGATCCAAGACAAAGAGG GTATTCCTCCGGACCAGCAGAGGCTGATCTTCGCCGGAAAGCAGTTGGAGGATGGCAGAACTCTTGCTGACTACAATATCCA GAAGGAGTCCACCCTTCATCTTGTTCTCAGGCTCCGTGGTGGTATGCAGATTTTCGTTAAGACGTTGACTGGGAAAACTATCAC TTTGGAGGTGGAGAGTTCTGACACCATTGATAACGTGAAAGCCAAGATCCAAGACAAAGAGGGTATTCCTCCGGACCAGCAG AGATTGATCTTCGCCGGAAAACAACTTGAAGATGGCAGAACTTTGGCCGACTACAACATTCAGAAGGAGTCCACACTCCACTT GGTCTTGCGTCTGCGTGGAGGTATGCAGATCTTCGTGAAGACTCTCACCGGAAAGACCATCACTTTGGAGGTGGAGAGTTCT GACACCATTGATAACGTGAAAGCCAAGATCCAGGACAAAGAGGGTATCCCACCGGACCAGCAGAGATTGATCTTCGCCGGAA AGCAACTTGAAGATGGAAGAACTTTGGCTGACTACAACATTCAGAAGGAGTCCACACTTCACTTGGTCTTGCGTCTGCGTGGA GGTATGCAGATCTTCGTGAAGACTCTCACCGGAAAGACTATCACTTTGGAGGTAGAGAGCTCTGACACCATTGACAACGTGAA GGCCAAGATCCAGGATAAGGAAGGAATCCCTCCGGACCAGCAGAGGTTGATCTTTGCCGGAAAACAATTGGAGGATGGTCGT ACTTTGGCGGATTACAACATCCAGAAGGAGTCGACCCTTCACTTGGTGTTGCGTCTGCGTGGAGGTATGCAGATCTTCGTCAA GACTTTGACCGGAAAGACCATCACCCTTGAAGTGGAAAGCTCCGACACCATTGACAACGTCAAGGCCAAGATCCAGGACAA GGAAGGTATTCCTCCGGACCAGCAGCGTCTCATCTTCGCTGGAAAGCAGCTTGAGGATGGACGTACTTTGGCCGACTACAAC ATCCAGAAGGAGTCTACTCTTCACTTGGTCCTGCGTCTTCGTGGTGGTTTCTAA
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - Domain Search
Lab 1 - Domain Search
Lab 1 - Domain Search
Lab 1 - Structure Visualization Pymol
Lab 1 - Phylogenetics http://www.megasoftware.net
Lab 1 - Phylogenetics UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Maximum likelihood Maximum parsimony Neighbor joining MrBayes: Bayesian Inference of Phylogeny