Laboratorio di Bioinformatica Lezione #2 Dr. Marco Fondi Contact: marco.fondi@unifi.it www.unifi.it/dblemm/ tel. 0552288308 Dip.to di Biologia Evoluzionistica Laboratorio di Evoluzione Microbica e Molecolare, Università di Firenze
Lezione #2 b)web resources for bioinformatics b) BLAST (Basic Local Alignment Search Tool)
?
Wet-Lab experiments DATA Bibliographic Databases Taxonomic Databases WEB Databases Nucleotide Databases Genomic Databases Protein Databases Microarray Databases
Knowledge bases = Biological databases Punto di partenza di qualsiasi analisi bioinformatica (e non). Melanie
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data) DataBase overview
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data)
EMBL-EBI
GenBank
PDB (Protein DataBank) database
JGI Database
sequence in FASTA Format
FASTA Format >gi 193425 gb M60978.1 MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT gi number Locus Name ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT Database Identifiers GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC Accession number TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC gb GenBank CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC emb EMBL CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC dbj DDBJ CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC sp SWISS-PROT GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC pdb Protein Databank GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG pir PIR GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC FASTA Definition Line >gi 193425 gb M60978.1 MUSGAPDS ref RefSeq
Text search DB Sequence in FASTA Format BLAST Sequence similarity search >gi 193425 gb M60978.1 MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data)
Molecola di DNA Sequenza in formato FASTA: >Cromosoma (TITOLO) ATCATTATTGATCCTGATCGGTTAGCAT CGTATTTCCTTACCGGGACCCCATGATC GATACAGTAAACCTTAGGATGATTATTG ATGCTGATCGGTTAGCATCGTATTTCCT TACCGGGACCCCATGATCGATACAGTA AACCTTAGGTGATTATTGATCCTGATCG GTTAGCATCGTATTTCCTTACCGGGACC CCATGATCGATACAGTAATAATTAGGAT GATTATTGATCCTGATCGGTTAGCATCG TATTTCCTTACCGGGACCCCATGATCGA TACAGTAAACCTTAGGATGATTATTGAT CCTGATCGGTTAGCATCGTATTTCCTTA CCGGGACCCCATGATCGATACAGTAAA CCTTAGATGATTATTGATCCTGATCGGT ATGCATCGTATTTCCTTACCGGGACCCC ATGATCGATACAGTAAACCTTAGGTTGA ATCGTATTTCCTTACCGGGACCCCATGA TCGATACAGTAAACCTTAGGTAGCATCG TATTTCCTTACCGGGACCCCATGATCGA ATGAGTAAACCTTAGGTAGCATTGAATT TCCTTACCGGGACCCCATGATCGATACA GTAAACCTTAGG..
ORF Finder @ NCBI:
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Gene Prediction (ORF finding) Protein Structure Taxonomy Expression profiles (Microarray Data) Metabolic pathways information
Ho un gene (una sequenza), in quale processo metabolico è coinvolto? Dato un processo metabolico, quali sono i geni coinvolti?
Metabolic pathways information @ KEGG
Metabolic pathways information @ KEGG
Apoptosis in Homo sapiens
Apoptosis in Monodelphis domestica
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Protein Structure Gene Prediction (ORF finding) Taxonomy Metabolic pathways information Expression profiles (Microarray Data)
Ogni proteina ha una sua struttura 3D Amino acid sequence NLKTEWPELVGKSVEE AKKVILQDKPEAQIIVL PVGTIVTMEYRIDRVR LFVDKLDNIAEVPRVG Folding!
Protein Structure in the WEB Strutture note Predizioni di strutture If prediction = true
Protein structure prediction
Protein structure @ NCBI
Disegno di farmaci drug design Protein-protein docking Evoluzione Proteomica Assegnazione funzionale
Sequence Data/Genome Data atgctggactgagtaatcct MQYYLERRSQMPGYTRYMML Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data)
Expression profiles (Microarray Data) Array Analysis Hierarchical Clustering
Gene Expression @ NCBI
Expression profile: Interazioni proteina-proteina Assegnazione funzionale Proteomica
NCBI ( http://www.ncbi.nlm.nih.gov/) Entrez interface to databases Medline/OMIM Genbank/Genpept/Structures BLAST server(s) Five-plus flavors of blast Draft Human Genome Much, much more
INTEGRATION!!!
Things to know and remember about using web server-based tools State usando il computer di qualcun altro (Probabilmente) state utilizzando un insieme ristretto delle opzioni disponibili Grande utilità per analisi preliminari e veloci. Per analisi più accurate e complesse è preferibile utilizzare database e software in maniera locale La pratica e gli errori (intelligenti!!!) sono il miglior modo per imparare
Sequence Comparison BLAST Basic Local Alignment Search Tool
Perché comparare le sequenze? Per individuare quali altri organismi possiedono il gene sotto studio (query) (es. produzione antibiotici, target per farmaci) Per una preliminare attribuzione funzionale (hypothetical protein, putative function)
Attribuzione funzionale AACGT TTGCC TATAG Confronto sequenze (BAST) proteina X funzione ignota Database sequenze Sequenze simili Trasferimento dell informazione relativa alla funzione proteina X funzione A proteina 1 funzione A proteina 2 funzione A proteina 3 funzione A proteina 4 funzione A proteina 5 funzione A proteina 6 funzione A proteina 7 funzione A proteina 8 funzione A
Sequence in FASTA Format QUERY >gi 193425 gb M60978.1 MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA BLAST DB Lista di sequenze simili alla query
BLAST in the web @NCBI
Using Basic BLAST Methods Example: MASH-1 protein sequence from mouse Can I find similar proteins in Human?
Input Query Choose Database
Submitting Your Query Input query sequence FASTA Raw Accession/ ID Choose Database Many available; varies with program For complete list follow the link to:
Finds Conserved Domains Limit results with entrez query E-Value cut off
Submitting Your Query CD Search Finds conserved domains in query sequence Compares to patterns and profiles of CDs Limit by entrez query Restricts results to single organism etc. E-value cut off Restricts results to ones falling below defined e-value Default = 10 Will revisit concept of e-value
Filtering Matrix Gap Penalties
Submitting Your Query Low complexity filtering Low complexity sequence can lead to spurious alignments Filtering hides these regions On by default SEG (proteins) or DUST (nucleic acids) Should turn it off in some cases what if your entire sequence gets filtered?
Submitting Your Query Choice of scoring matrix Different ones available BLOSUM matrices based on observed frequencies of a.a. substitutions Each tailored to different levels of sequence divergence and length BLOSUM 62 = default Shown to be best at detecting most protein similarities don t usually need to change Follow link for detailed information
Submitting Your Query Gap Penalties Accounts for insertions and deletions in different sequences Scores are penalized for gaps to prevent aberrant alignments Opening penalty is high; extension penalty is lower Defaults may change depending on matrix choice Rarely need to change default value
Protein Words Query:GTQITVEDLFYNIATRRKALKN GTQ Word size = 3 (default) TQI Word size can only be 2 or 3 QIT ITV Make a lookup table of words TVE VED EDL DLF...
Query: GTQITVEDLFYNIATRRKALKN TQI QIT ITV TVE VED EDL DLF... ch! M at GTQ DB extend extend TVEDLFRRLKIAGTQEDLRRT GGHPYTTFWWYQLMERGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ
Query: GTQITVEDLFYNIATRRKALKN TVEDLFRRLKIAGTQEDLRRT GGHPYTTFWWYQLMERGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ.. GRTHPYTTTWWEWHHRGTQ Score Score Score Score Score Score..
E-values Bit Scores
Click for more info Take note
Basic BLAST programs and databases In 6 frames Nucleotide Sequence blastn Protein Sequence Translated Protein Sequence tblastn blastp blastx Nucleotide DB In 6 frames tblastx Protein DB Translated DB (contain amino acid sequences)