BLAST Bioinformática Search for Homologies BLAST BLAST - Basic Local Alignment Search Tool http://blastncbinlmnihgov/blastcgi 1 2 Blast information guide Buscas de sequências semelhantes http://blastncbinlmnihgov/blastcgi?cmd=web&page_type=blastdocs Muito usado em bioinformática O objectivo é aprender mais sobre sequências de DNA, RNA e proteínas através da busca de sequências semelhantes com funções conhecidas A busca engloba: Software de busca Bases de dados de sequências anotadas Finalmente pretende-se obter alinhamentos de boa qualidade entre a nossa sequência e a(s) da BD 3 4 1
Alinhamentos e termos das buscas Alinhamento: emparelhamento de 2 sequências Termos dos alinhamentos Alinhamento (Match): duas letras idênticas numa mesma posição no alinhamento Alinhamento Global: alinha sequências na sua totalidade Alinhamento Local: procura e alinha as regiões mais semelhantes entre as sequências Falso alinhamento (Mismatch): duas letras diferentes numa mesma posição no alinhamento Intervalos (Gaps) A busca de semelhanças numa BD faz-se pelo alinhamento de uma única sequência query a cada uma das sequências da BD (sequência alvo target ) Se forem encontrados boas semelhanças a procura gera uma lista de HSPs - High-scoring Segment Pairs (alinhamentos locais entre a query e o target) Positivo: uma substituição conservativa numa posição num alinhamento Percent identity: 100 * (number of matches/length of the alignment) 7 Percent positives: 100 * (number of positives/length of the alignment) 8 BLAST - Basic Local Alignment Search Tool BLAST Statistics Altschul et al, 1990 Programa mais intensamente usado Muito rápido pois usa uma heurística para tornar a busca mais rápida, por isso não é garantido que encontre o maior score possível num alinhamento local Possui programas de buscas de alinhamentos locais de HSPs entre a sequência de busca e a base de dados alvo (DNA ou proteína) BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit) BIT SCORE The value S is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account By normalizing a raw score using the formula Quanto maior o valor do score melhor é o alinhamento a bit score S is attained, which has a standard set of units, and where K and lambda are the statistical parameters of the scoring system Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches 9 10 2
BLAST Statistics The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used The lower the E-value, the more significant the hit A sequence alignment that has an E-value of 005 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone Algoritmos Heurísticos de Alinhamento Parâmetros para avaliar a qualidade do alinhamento Qual a verosimilhança desta similaridade? Será que ocorreu por acaso? Although a statistician might consider this to be significant, it still may not represent a biologically meaningful result, and analysis of the alignments is required to determine biological significance 11 12 BLOSUM62 Substitution Matrix A família BLAST 13 14 3
A família BLAST Sequências de nucleótidos - Que algoritmo usar? Program Selection for Nucleotide Queries Length ¹ Database Purpose Program Explanation Identify the query sequence discontiguous megablast, megablast, or blastn more 20 bp or longer Nucleotide Find sequences similar to query sequence discontiguous megablast or blastn more 28 bp or above for megablast Find similar sequence from the Trace archive Find similar proteins to translated query in a translated database Trace megablast, or more Trace discontiguous megablast Translated BLAST (tblastx) more Peptide Find similar proteins to translated query in a protein database Translated BLAST (blastx) more T translated Estes programas fazem a tradução da sequência de DNA para uma potencial proteína Só depois fazem a comparação das sequências Find primer binding sites or map short 7-20 bp Nucleotide Search for short, nearly exact matches more contiguous motifs NOTE: ¹ The cut-off is only a recommendation For short queries, one is more likely to get matches if the "Search for short, nearly exact matches" page is used Detailed discussion is in the Section 4 below With default setting, the shortest unambiguous query one can use is 11 for blastn and 28 for MEGABLAST 15 16 MegaBlast Search for short nearly exact matches MEGABLAST é um serviço BLAST que aceita inquéritos múltiplos Search for short nearly exact matches" deve ser usado para procurar primers ou sequências pequenas MEGABLAST descontínuo é melhor para encontrar sequências de nucleótidos semelhantes, mas não idênticas à sua sequência query Sequências com < 20 bp normalmente não dão resultados significativos com um Blastn normal porque as restricções usadas nos cálculos do E-value são muito apertadas Parameter settings for standard blastn and "Search for short and nearly exact matches" Program Word Size DUST Filter Setting Expect Value Standard blastn 11 On 10 Search for short nearly exact matches 7 Off 1000 17 18 4
Sequências de aa - Que programa usar? Program Selection for Protein Queries Length ¹ Database Purpose Program Explanation Identify the query sequence or find protein sequences Standard Protein BLAST (blastp) more similar to the query "Search for short nearly exact matches" Está optimizado para encontrar pequenos peptidos Recomendam-se pesquizas com mais de 5 aa Find members of a protein family or build a custom positionspecific score matrix PSI-BLAST more Peptide Find proteins similar to the query around a given pattern PHI-BLAST 15 residues or longer Find conserved domains in the query CD-search (RPS-BLAST) Find conserved domains in the query and identify other Conserved Domain Architecture proteins with similar domain architectures Retrieval Tool (CDART) Nucleotide Find similar proteins in a translated nucleotide database Translated BLAST (tblastn) Search for short, nearly exact Peptide Search for peptide motifs 5-15 residues matches more more more more more Parameter settings for standard blastp and "Search for short and nearly exact matches" Program Word Size SEG Filter Expect Value Score Matrix Standard Protein Blast 3 On 10 BLOSUM62 Search for short and nearly exact matches 2 Off 20000 PAM30 Note: ¹ The cut-off is only a recommendation For short queries, one is more likely to get matches if the "Search for short, nearly 19 exact matches" page is used Detailed discussion is in Section 4 below 20 BLASTP Exercícios Blastn & Blastx Attention to the differences between Identities and Positives 23 >1 GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAAT AAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAAAAA CCATTAAACGCTGCGATAGCAAGCATCTTTGCACAGAGTTGTTCTCAATGTAACGATAAAGTTGGTGATGG TACAACAACGTGCTCAATACTAACTAGCAACATGATAATGGAAGCTTCAAAATCAATTGCTGCTGGAAACG ATCGTGTTGGTATTAAAAACGGAATACAGAAGGCAAAAGATGTAATATTAAAGGAAATTGCGTCAATGTC TCGTACAATTTCTCTAGAGAAAATAGACGAAGTGGCACAAGTTGCAATAATCTCTGCAAATGGTGATAAG GATATAGGTAACAGTATCGCTGATTCCGTGAAAAAAGTTGGAAAAGAGGGTGTAATAACTGTTGAAGAG AGTAAAGGTTCAAAAGAGTTAGAAGTTGAGCTGACTACTGGCATGCAATTTGATCGCGGTTATCTCTCTCC GTATTTTATTACAAATAATGAAAAAATGATCGTGGAGCTTGATAATCCTTATCTATTAATTACAGAGAAAA AATTAAATATTATTCAACCTTTACTTCCTATTCTTGAAGCTATTGTTAAATCTGGTAAACCTTTGGTTATTATT GCAGAGGATATCGAAGGTGAAGCATTAAGCACTTTAGTTATCAATAAATTGCGTGGTGGTTTAAAAGTTG CTGCAGTAAAAGCTCCAGGTTTTGGTGACAGAAGAAAGGAGATGCTCGAAGACATAGCAACTTTAACTGG TGCTAAGTACGTC ATAAAAGATGAACTT >2 GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAAT AAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA 24 5
Exercícios Blast 2seqs >gi 121490207 emb AM2830981 Quercus ilex partial mrna for alphatubulin 6 (atub6 gene) ACCCCAGGATTCATTTCATGCTTTCTTCGTATGCCCCAGTTATCTCAG CTGAAAAGGCATATCATGAGCAGCTTTCAATTCCTGAAATCACAAATG CAGTGTTTGAGCCCTCAAGCATGATGGCTAAGTGTGATCCAAGGCAT GGGAAATACATGGCCTGCTGCTTAATGTACCGGGGAGATGTTGTTCC CAAGGATGTTAATGCTGCCGTTGGCACCATCAAAACCAAAAGAACTGT TCAGTTTGTTGACTGGTGCCCAACTGGCTTCAAATGTGGCATCAACTA TCAGCCTCCAACAGTTGTACCCGGTGGTGATCTTGCCAAGGTGCAGC GAGCTGTCTGCATGATCAGCAACAACACAGCAGTAGCTGAGGTTTTCT CACGTATTGACCACAAATTTGATCTCATGTATTCCAAAAGAGCATTTGT TCACTGGTATGTTGGTGAGGGCATGGAGGAAG >F TTGTTGACTGGTGCCCAACT >R CTCCATGCCCTCACCAACAT 25 6