Sequence information - lectures
|
|
|
- Magnus Lang
- 10 years ago
- Views:
Transcription
1 Sequence information - lectures Pairwise alignment Alignments in database searches Multiple alignments Profiles Patterns RNA secondary structure / Transformational grammars Genome organisation / Gene prediction / HMMs Phylogeny
2 Dates in the history of sequence information Experimental 1953 Amino acid sequence of insulin Base sequence of a trna molecule,75 bases 1976 Sequence of MS2 RNA, 3569 bases 1970s Birth of cloning 1977 DNA sequencing (Sanger,F) 1978 Bacteriophage genome fd (6408 bases) 1995 Genomes of H influenzae & M genitalium (1.8 / 0.6 million bases) 2000 >95% of human genome (3000 million bases) Sequence analysis 1970 Global alignment Needleman-Wunsch 1975 PAM matrix (Dayhoff) 1981 Local alignment Smith-Waterman 1982 EMBL database 1985 FASTA 1985 Multiple sequence alignment 1st attempts 1988 Neural networks (protein secondary structure) 1989 Clustal 1989 Profilesearch 1991 BLAST 1994 Profile HMMs 1994 Context-free grammars (RNA secondary structure) 1997 PSI-BLAST
3 Sequences - The analysis of raw data from the laboratory Base-calling (sequence from raw data) : Phred Sequence assembly: Phrap CAP3 / CAP4 Cleaning up sequences: Quality filtering Vector filtering
4 Sequence formats / ASCII-Binary formats: In ASCII text format (= human readable) each character is stored as a byte, for instance the ASCII code of 'A' = 65 as a decimal number = as a binary number However, sequence data is often stored in a binary format: For instance in a binary system the three bases may be stored as: A = 00 T = 01 C = 10 G = 11 In this way there is room in one byte (= 8 bits) for 4 bases. Typically databases are downloaded by the bioinformatician in ASCII format but then reformatted in a binary format for use with different sequence analysis tools. One example is databases for blast searches that may be formatted by the NCBI utility 'formatdb
5 How do you know if a file contains ASCII text or has a binary format? Using cat, more or less on a binary file % cat /usr/local/bin/less produces unreadable material, like: ^?ELF^A^B^A^@^@^@^@^@^@^@^@^@^@^B^^@^@^@^A^@@:\260^@^@^@4^@^Az0^@^@^@^D^@4^@ ^@ ^G^@(^@^Q^@^P^@^@^@^F^@^@^@4^@@^@4^@@^@4^@^@^@\340^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@ ^C^@^@^A ^@@^A ^@@^A ^@^@^@^S^@^@^@^S^@^@^@^D^@^@^@^Dp^@^@^@^@^@^A@^@@^A@^@@^A@ ^@^@^@^X^@^@^@^X^@^@^@^D^@^@^@^D^@^@^@^B^@^@^A\200^@@^A\200^@@^A\200^@^@00^@^@00 ^@^@^@^D^@^@^@^Pp^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D ^@^@^@^A^@^@^@^@^@@^@^@^@@^@^@^@^A@^@^@^A@^@^@^@^@^E^@^A^@^@^@^@^@^A^@^A@^@^P^@ ^@^@^P^@^@^@^@^@ ^@^@^@;\200^@^@^@^F^@^A^ The unix utility file will attempt to determine the file type, for instance % file README.txt ascii text % file /usr/local/bin/less ELF 32-bit MSB dynamic executable MIPS - version 1 % file data.z compressed data
6 Sequence formats Examples 1. Embl ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
7 2. Fasta >LISOD L.ivanovii sod gene for superoxide dismutase cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg GCG lisod.seq Length: 756 October 27, :17 Type: N Check: cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc 51 gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg 101 aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct
8 Multiple sequence format of GCG 1.msf MSF: 44 Type: P October 24, :41 Check: Name: ftsy_bucai Len: 44 Check: 4221 Weight: 1.00 Name: ftsy_ecoli Len: 44 Check: 2326 Weight: 1.00 Name: ftsy_aquae Len: 44 Check: 2339 Weight: 1.00 Name: ftsy_bacsu Len: 44 Check: 6177 Weight: 1.00 Name: sr54_aciam Len: 44 Check: 6296 Weight: 1.00 Name: sr54_aerpe Len: 44 Check: 7291 Weight: 1.00 Name: sr54_arcfu Len: 44 Check: 122 Weight: 1.00 Name: sr54_aquae Len: 44 Check: 345 Weight: 1.00 // 1 44 ftsy_bucai KNS.EKLYFL LKRKMFNILK KVEIP...LE ISSHSPFVIL VVGV ftsy_ecoli RDA.EALYGL LKEEMGEILA KVDEP...LN VEGKAPFVIL MVGV ftsy_aquae KEG.EKIKEL LKKELKELLK NCQ...GELK IPEKVGAVLL FVGV ftsy_bacsu QDP.KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK...IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK....ADLKKGTVL FVGL
9 FASTA multiple sequence format >ftsy_bucai, 44 bases, 52FF1180 checksum. KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV >ftsy_ecoli, 44 bases, B14506B6 checksum. RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV >ftsy_aquae, 44 bases, 27571BEE checksum. KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV >ftsy_bacsu, 44 bases, FA023A4F checksum. QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV >sr54_aciam, 44 bases, 2FC13632 checksum. PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV >sr54_aerpe, 44 bases, 37AFB895 checksum. PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV >sr54_arcfu, 44 bases, checksum. LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL >sr54_aquae, 44 bases, A2 checksum. PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL
10 CLUSTAL multiple sequence format CLUSTAL W (1.81) multiple sequence alignment ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL.:. :: ::.**:
11 Phylip multiple sequence format 8 44 ftsy_bucai KNS-EKLYFL LKRKMFNILK KVEIP---LE ISSHSPFVIL VVGV ftsy_ecoli RDA-EALYGL LKEEMGEILA KVDEP---LN VEGKAPFVIL MVGV ftsy_aquae KEG-EKIKEL LKKELKELLK NCQ---GELK IPEKVGAVLL FVGV ftsy_bacsu QDP-KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK----IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK---- -ADLKKGTVL FVGL
12 Sequence editors SeqLab / GCG
13 Genedoc
14 Conversion of sequence formats - Readseq (one useful version of readseq is part of the SAM package Readseq can convert between the following formats: 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only)
15 GCG package utilities: Convert from to * Reformat text GCG * FromEMBL EMBL GCG * FromGenBank Genbank GCG * FromFasta Fasta GCG * ToFastA GCG Fasta Clustal: clustalw my_alignment -convert -output=gcg clustalw my_alignment -convert -output=phylip
16 Downloading bioinformatics data and software for local use
17
18 Downloading bioinformatics data and software for local use Data and software downloaded from the net are typically compressed in one way or the other. The most common formats are *.Z and *.gz These are binary files that cannot be read with normal editors To uncompress a *.Z file : % uncompress [file] To uncompress a *.gz file : % gunzip [file] A set of files is often downloaded as a set of compressed tar archive. For instance if you download a file called dna.tar.gz you can do % gunzip dna.tar.gz This will yield a file called dna.tar. Then do % tar -xvf dna.tar This will unpack all files contained in dna.tar
19 Alignments and database searches Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein?? * Pattern search - PROSITE * Profile search - Pfam * Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly homologous to a protein with known function from another organism => The human protein has the same function (it s an ortholog or a paralog)
20 BLAST 1. Heuristic step where short word matches are identified 2. Extending matches using dynamic programming method as for local pairwise alignments. (Substitution matrix, gap penalties) -The nature of protein sequence and structure evolution - what is the sensitivity of BLAST searches? -General principles of database searches
21 Evolution of protein genes ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G 60% ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 69% M V R I Q K I N E K G A L L A G 38% Q V R I Q K I Y E K G A L L A A 19% ( twilight zone ) Q V R I Q K I Y E K T A L L F A 6% ( midnight zone )
22 Blast report Sequences producing significant alignments: (bits) Value pir F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdc) e-129 gb AAD (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase e-060 sp P39383 YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I e-046 emb CAA (X98916) orf6 [Methanopyrus kandleri] 170 1e-041 gb AAF AF156260_1 (AF156260) unknown [Methanosarcina bark e-033 pir A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact e-030 pir A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela e-029 gb AAC (U75363) benzoyl-coa reductase subunit [Rhodopseu e-025 pir S04476 hypothetical protein (hdga 5' region) - Acidaminococ e-021 sp P27542 DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP gb AAC (AF016711) heat shock protein 70 [Burkholderia ps pir F75029 o-sialoglycoprotein endopeptidase (gcp) PAB Py pir F72514 probable glucokinase APE Aeropyrum pernix (str sp P42373 DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP emb CAA (AJ012470) mitochondrial-type hsp70 [Encephalito sp P56836 DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP gb AAF (AE002336) dnak protein [Chlamydia muridarum] pir B70189 rod shape-determining protein (mreb-1) homolog - Lym sp O57716 GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE ( sp O33522 DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP ref NP_ Ykl050cp >gi sp P35736 YKF0_YEAST HYPOTH emb CAA (X75781) D513 [Saccharomyces cerevisiae] >gi sp P30722 DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi pir A40158 dnak-type molecular chaperone - Chlamydia trachomati gb AAF AE001584_39 (AE001584) hypothetical protein [Borre gb AAF AE001577_35 (AE001577) hypothetical protein [Borre gb AAF (AE002276) cell shape-determining protein MreB [C gb AAG AE004889_10 (AE004889) DnaK protein [Pseudomonas a dbj BAB (AB017035) dnak [Bacillus thermoglucosidasius] sp P43736 DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP sp P45554 DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP sp Q58303 FLA3_METJA FLAGELLIN B3 PRECURSOR gb AAG AE004898_10 (AE004898) phosphoribosylaminoimidazol
23 Conclusions / comments If two proteins have significant sequence homology it is highly likely that the two proteins have the same 3D structure (and same function) If two proteins have the same 3D structure is does not necessarily mean that the sequences are related Very remote evolutionary relationships are difficult or impossible to detect with normal BLAST / pairwise alignment Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level
24 More rules of database searches?compare sequences as proteins and not as DNA*?Use of smallest possible database (not too small though)?sequence statistics should be used rather than percent identity/similarity as criterion for homology?consider different scoring matrices and gap penalties * 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code. TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S 2) Amino acid substitution matrices may be taken into account.
25 PAM matrices Starting point of PAM matrices are closely related orthologs, full sequences are taken into account. In the calculation of the PAM matrix extrapolation is done to more distant relationships Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE
26 BLOSUM matrices. Starting point is conserved elements in protein families. Protein sequences are more distantly related than the sequences used for PAM matrices. Example of entry from BLOCKS DATABASE: Block BL00134A ID TRYPSIN_HIS; BLOCK AC BL00134A; distance from previous block=(23,4353) DE Serine proteases, trypsin family, histidine proteins. BL CHC; width=17; seqs=259; 99.5%=867; strength=1439 CTR2_VESCR P00769 ( 24) FCGGSISKRYVLTAAHC 63 CTRL_HALRU P35003 ( 53) CGCVLYTTSKALTAAHC 80 EAST_DROME P13582 ( 158) CGGSLISTRYVITASHC 18 GILX_HELHO P43685 ( 26) WSGVLLNRDWILTAAHC 45 NDL_DROME P98159 (1170) CGGTIYSDRWIISAAHC 10 PCE_TACTR P21902 ( 157) CGGALVTNRHVITASHC 25 TRYP_ASTFL P00765 ( 30) CGASIYNENYAITAGHC 36 TRYU_DROME P42279 ( 59) CGGCILDAVTIATAAHC 31 ACRO_HUMAN P10323 ( 73) CGGSLLNSRWVLTAAHC 4 ACRO_MOUSE P23578 ( 74) CGGSLLNSHWVLTAAHC 5 ACRO_PIG P08001 ( 71) CGGILLNSHWVLTAAHC 7 ACRO_RABIT P48038 ( 71) CGGVLLNAHWVLTAAHC 6 ACRO_RAT P29293 ( 74) CGGSLLNSHWVLTAAHC 5 ANC1_AGKRH P26324 ( 28) CGGVLIHPEWVITAEHC 25 ANC2_AGKRH P47797 ( 52) CGGVLIHPEWVITAKHC 16 ANCR_AGKCO P09872 ( 25) CGGTLINQEWVLTARHC 25 BATX_BOTAT P04971 ( 50) CGMTLINQEWVLTAAHC 49
27
28 Variants of BLAST / FASTA DNA/DNA P/P DNA/P P/DNA DNA/DNA blastall -p blastn -p blastp -p blastx -p tblastn -p tblastx fasta fasta fasta fastx,fasty* tfasta tfastx, tfasty *Compare a DNA sequence to a protein sequence database, by comparing the translated DNA sequence in three frames and allowing gaps and frameshifts. fastx3 uses a simpler, faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
(A GUIDE for the Graphical User Interface (GUI) GDE)
The Genetic Data Environment: A User Modifiable and Expandable Multiple Sequence Analysis Package (A GUIDE for the Graphical User Interface (GUI) GDE) Jonathan A. Eisen Department of Biological Sciences
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
BIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
Apply PERL to BioInformatics (II)
Apply PERL to BioInformatics (II) Lecture Note for Computational Biology 1 (LSM 5191) Jiren Wang http://www.bii.a-star.edu.sg/~jiren BioInformatics Institute Singapore Outline Some examples for manipulating
Module 10: Bioinformatics
Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
Clone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Integration of data management and analysis for genome research
Integration of data management and analysis for genome research Volker Brendel Deparment of Zoology & Genetics and Department of Statistics Iowa State University 2112 Molecular Biology Building Ames, Iowa
Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
Sequence formats and databases in bioinformatics
Sequence formats and databases in bioinformatics Definitions/Basics Sequence formats Databases in Biology Dinesh Gupta Structural and Computational Biology Group ICGEB [email protected] What is Bioinformatics?
Databases and mapping BWA. Samtools
Databases and mapping BWA Samtools FASTQ, SFF, bax.h5 ACE, FASTG FASTA BAM/SAM GFF, BED GenBank/Embl/DDJB many more File formats FASTQ Output format from Illumina and IonTorrent sequencers. Quality scores:
Linear Sequence Analysis. 3-D Structure Analysis
Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic
UGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr
Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
EMBL-EBI Web Services
EMBL-EBI Web Services Rodrigo Lopez Head of the External Services Team SME Workshop Piemonte 2011 EBI is an Outstation of the European Molecular Biology Laboratory. Summary Introduction The JDispatcher
Data formats and file conversions
Building Excellence in Genomics and Computational Bioscience s Richard Leggett (TGAC) John Walshaw (IFR) Common file formats FASTQ FASTA BAM SAM Raw sequence Alignments MSF EMBL UniProt BED WIG Databases
UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS
UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS 1.0 Aims and Objectives 1.1 Introduction to Bioinformatics 1.2 Landmark Sequences Completed 1.3 Sequence Analysis: Sequence to Potential Function 1.4 The
Introduction to GCG and SeqLab
Oxford University Bioinformatics Centre Introduction to GCG and SeqLab 31 July 2001 Oxford University Bioinformatics Centre, 2001 Sir William Dunn School of Pathology South Parks Road Oxford, OX1 3RE Contents
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
Network Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe [email protected] ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
DNA Sequence formats
DNA Sequence formats [Plain] [EMBL] [FASTA] [GCG] [GenBank] [IG] [IUPAC] [How Genomatix represents sequence annotation] Plain sequence format A sequence in plain format may contain only IUPAC characters
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Laboratorio di Bioinformatica
Laboratorio di Bioinformatica Lezione #2 Dr. Marco Fondi Contact: [email protected] www.unifi.it/dblemm/ tel. 0552288308 Dip.to di Biologia Evoluzionistica Laboratorio di Evoluzione Microbica e Molecolare,
fasta-36.3.8 July 28, 2015
The FASTA program package Introduction This documentation describes the version 36 of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis,
Welcome to the Plant Breeding and Genomics Webinar Series
Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen
Using MATLAB: Bioinformatics Toolbox for Life Sciences
Using MATLAB: Bioinformatics Toolbox for Life Sciences MR. SARAWUT WONGPHAYAK BIOINFORMATICS PROGRAM, SCHOOL OF BIORESOURCES AND TECHNOLOGY, AND SCHOOL OF INFORMATION TECHNOLOGY, KING MONGKUT S UNIVERSITY
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS
Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
Sequence homology search tools on the world wide web
44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: [email protected] Introduction Sequence homology
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
Software review. Analysis for free: Comparing programs for sequence analysis
Analysis for free: Comparing programs for sequence analysis Keywords: sequence comparison tools, alignment, annotation, freeware, sequence analysis Abstract Programs to import, manage and align sequences
Amino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov
Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Comparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
Phylogenetic Trees Made Easy
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
Molecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing
KOO10 5/31/04 12:17 PM Page 131 10 Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing Sandra Porter, Joe Slagel, and Todd Smith Geospiza, Inc., Seattle, WA Introduction The increased
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Software review. Vector NTI, a balanced all-in-one sequence analysis suite
Vector NTI, a balanced all-in-one sequence analysis suite Keywords: sequence analysis, software package, database, virtual cloning, sequence assembly Abstract Vector NTI is a well-balanced desktop application
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
4.2.1. What is a contig? 4.2.2. What are the contig assembly programs?
Table of Contents 4.1. DNA Sequencing 4.1.1. Trace Viewer in GCG SeqLab Table. Box. Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu. Select the trace files
Geneious 4.0.2. Biomatters Ltd
Geneious 4.0.2 Biomatters Ltd 17th September 2008 2 Contents 1 Getting Started 7 1.1 Downloading & Installing Geneious.......................... 7 1.2 Using Geneious for the first time............................
Global and Discovery Proteomics Lecture Agenda
Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global
Translation Study Guide
Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to
Unipro UGENE User Manual Version 1.12.3
Unipro UGENE User Manual Version 1.12.3 April 01, 2014 Contents 1 About Unipro................................... 10 1.1 Contacts.......................................... 10 2 About UGENE..................................
Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST
BLAST Bioinformática Search for Homologies BLAST BLAST - Basic Local Alignment Search Tool http://blastncbinlmnihgov/blastcgi 1 2 Blast information guide Buscas de sequências semelhantes http://blastncbinlmnihgov/blastcgi?cmd=web&page_type=blastdocs
Core Bioinformatics. Degree Type Year Semester
Core Bioinformatics 2015/2016 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected] Teachers Use of
Design Style of BLAST and FASTA and Their Importance in Human Genome.
Design Style of BLAST and FASTA and Their Importance in Human Genome. Saba Khalid 1 and Najam-ul-haq 2 SZABIST Karachi, Pakistan Abstract: This subjected study will discuss the concept of BLAST and FASTA.BLAST
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS
BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title: Bioinformatics
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs Richard J. Edwards 2008. Contents 1. Introduction... 2 1.1. Version...2 1.2. Using this Manual...2 1.3. Why use BUDAPEST?...2
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
BioEdit version 7.0.0
BioEdit version 7.0.0 This is the current help file for BioEdit version 5.0.6. Copyright 1997-2004 Tom Hall Ibis Therapeutics, a division of Isis Pharmaceuticals, Inc. This is likely to be the final release
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Databases indexation
Databases indexation Laurent Falquet, Basel October, 2006 Swiss Institute of Bioinformatics Swiss EMBnet node Overview Data access concept sequential direct Indexing EMBOSS Fetch Other BLAST Why indexing?
Sequencing the Human Genome
Revised and Updated Edvo-Kit #339 Sequencing the Human Genome 339 Experiment Objective: In this experiment, students will read DNA sequences obtained from automated DNA sequencing techniques. The data
T cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
Next generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT [email protected] 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
