Databases and mapping BWA Samtools
FASTQ, SFF, bax.h5 ACE, FASTG FASTA BAM/SAM GFF, BED GenBank/Embl/DDJB many more File formats
FASTQ Output format from Illumina and IonTorrent sequencers. Quality scores:
FASTQ Headers (Casava 1.8, qualities Sanger encoded) @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG EAS139 the unique instrument name 136 the run id FC706VJ the flowcell id 2 flowcell lane 2104 tile number within the flowcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails filter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence
SFF Standard Flowgram Format - binary format used to encode results from 454 sequencers - can be converted to fasta/fastq (sff2fastq tool)
PacBio files.bax.h5 The.bax.h5 files contain sequence data..bas.h5 The bas.h5 file now contains only the information necessary to dereference by hole number the ZMWlevel data. There are currently several different combinations of polymerases (P1-5) and chemistries (C1-3) used by PacBio. They differ by output file and error it's good to know which combination generated your data.
ACE Stores complete data about genomic contigs. All assemblers can be run with this or similar file output. Recommended for your final assembly! You can have a look at broken pairs of reads, browse differences in sequencing coverage,...
FASTG A format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. The G stands for graph. http://fastg.sourceforge.net #FASTG:begin; #FASTG:version=1.0:assembly_name="tiny example"; >chr1:chr1; ACGANNNNN[5:gap:size=(5,4..6)]CAGGC[1:alt:allele C,G]TATACG >chr2;4 ACATACGCATATATATATATATATATAT[20:tandem:size=(10,8..12) AT]TCA GGCA[1:alt A,T,TT]GGAC #FASTG:end;
FASTA Be consistent when naming your fasta files! Avoid special characters and spaces in headers..fa, fas.,.fasta,.fna,.faa >sequence_name GGAGGGGACGACGTCAAGTCATCATGGCCTTTATGGGTGGGGCTTCACACGTCATACAATGGTTGGAGCA AAGGGTCGCCAACTCGAGAGAGGGAGCTAATCCCACAAACCCAGCCCCAGTTCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTAGGAATCGCTAGTAATCGTGGATCAGCATGCCACGGTGAATACGTTCCCGGGTCT TGTACACACCGCCCGTCACACCATGGAAGTAGGCCGCATCCGAAGCAGCCTCCCTAACCCTATTGCTGGG AAGGAGGCTGCGAAGGTGGGGTCTATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGCG
BAM/SAM The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. Contains header and alignment sections..bam.bai &.sam.sai are indexed version of files for quick access of data
GFF/GTF General Feature Format, currently GFF3. The GTF (General Transfer Format) is a refinement of GFF Version 2 and is sometimes referred to as GFF2.5 - used for describing genes and other features of DNA, RNA, and protein sequences http://www.sequenceontology.org/gff3.shtml
BED BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used. http://genome.ucsc.edu/faq/faqformat.html#format1
GenBank/EMBL/DDJB http://www.ncbi.nlm.nih.gov/nuccore/11466244?report=genbank The GenBank sequence format is a rich format for storing sequences and associated annotations. It shares a feature table vocabulary and format with the EMBL and DDJB formats. FEATURES: source, gene, CDS
Software needed for this lecture Directory with all programs will be distributed on a USB flash drive. Add the folder with all binaries (executable files) into your path or run the individual programs locally. We will use Blast+, Hmmer, Bowtie/BWA and Samtools today. These programs are used on daily basis by almost every bioinformatician dealing with genomic data and they can be easily run on a laptop. BWA and Samtools need to be compiled from source. If the compilation fails (e.g. missing zlib for samtools), please let me know. Type: cd bwa-0.7.8; make cd samtools-0.1.19; make
Sequence databases NR - non-redundant proteins from GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF - produced by NCBI. RefSeq - NCBI reference sequence collection, a set of taxonomically diverse, non-redundant and richly annotated sequences. UniProtKB - comprehensive resource for protein sequence and annotation data produced by the Universal Protein Resource consortium. Pfamseq - Pfamseq is the underlying sequence database that Pfam is built upon. As there should be no-overlaps between Pfam domains, this provides a stable sequence database for investigating domains and domain architectures. Swiss-Prot - Manually reviewed, high quality protein sequence and functional annotation - produced by UniProt. PDB - sequences with an experimentally determined structure.
Databases of metabolic pathways & enzyme nomenclature KEGG http://www.genome.jp/kegg/ BioCyc http://biocyc.org/ ExPASy Enzyme http://enzyme.expasy.org/ BRENDA http://www.brenda-enzymes.info/ Keep in mind that an identical enzymatic reaction can be carried out by enzymes coded by completely different genes. Enzyme Commission number (EC number) is a numerical classification for enzymes, based on the chemical reactions they catalyze.
How to make custom blast databases makeblastdb -in fastafile -dbtype nucl/prot update_blastdb.pl - perl script bundled with blast programs - alows downloading/upgrading blast databases such as nr/nt/refseq/swissprot from NCBI ftp://ftp.ncbi.nlm.nih.gov/blast/db/
BLAST+ programs ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ blastn - Search a nucleotide database using a nucleotide query blastx - Search protein database using a translated nucleotide query blastp/psiblast - Search protein database using a protein query tblastn - Search translated nucleotide database using a protein query tblastx - Search translated nucleotide database using a translated nucleotide query blastx -help - prints help for the particular blast program
Using blast blastp -db databasename -query yourfastafile > seq_vs_database.blastp Useful blast arguments: -h prints help for a particular blast program -outfmt -num_descriptions -num_alignments -evalue -num_threads
Blast exercise Try to find immune proteins in the recently published tsetse fly genome by searching it with Drosophila melanogaster immunity proteins as queries. Use several different e-value cut-offs (1, 1e-3, 1e-6, 1e- 8,...) and output formats. makeblastdb -in Glossina_morsitans.faa -dbtype prot blastp -db Glossina_morsitans.faa -query Drosophila_melanogaster_imunity.faa -num_threads 4 > Dmel_immunity_vs_Gmors.blastp
Blast output formats *** Formatting options -outfmt <String> alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1) Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.
HMMER HMMER is mainly used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs). Compared to BLAST and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3 project, HMMER is now essentially as fast as BLAST. Webserver: http://hmmer.janelia.org/search/phmmer User's guide: [PDF, 116 pages] ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/userguide.pdf
Standalone HMMER3 http://hmmer.janelia.org/software 1) Build models and align sequences (DNA or protein) hmmbuild Build a profile HMM from an input multiple alignment. hmmalign Make a multiple alignment of many sequences to a common profile HMM.
Individual hmmer programs Search protein queries against protein database phmmer Search a single protein sequence against a protein sequence database. (BLASTP-like) jackhmmer Iteratively search a protein sequence against a protein sequence database. (PSIBLAST-like) hmmsearch Search a protein profile HMM against a protein sequence database. hmmscan Search a protein sequence against a protein profile HMM database Search DNA queries against DNA database nhmmer Search a DNA sequence, alignment, or profile HMM against a DNA sequence database. (BLASTN-like) nhmmscan Search a DNA sequence against a DNA profile HMM database.
Searching a protein sequence database with a single protein profile HMM The subdirectory /tutorial in the HMMER distribution contains the files used in the tutorial, as well as a number of examples of various file formats that HMMER reads. hmmbuild globins4.hmm tutorial/globins4.sto hmmsearch globins4.hmm tutorial/globins45.fa > globins4.out phmmer tutorial/hbb_human tutorial/globins45.fa jackhmmer tutorial/hbb_human tutorial/globins45.fa
Searching a profile HMM database with a query sequence hmmbuild globins4.hmm tutorial/globins4.sto hmmbuild fn3.hmm tutorial/fn3.sto hmmbuild Pkinase.hmm tutorial/pkinase.sto cat globins4.hmm fn3.hmm Pkinase.hmm > minifam hmmpress minifam hmmscan minifam tutorial/7less_drome
Target profile HMM databases Gene3D - a collection of models that are based on CATH structural protein domains. Pfam - a large comprehensive collection of protein families. Superfamily - a collection of models, which represent structural protein domains at the SCOP superfamily level. TIGRFAMS - models that are designed for automated sequence annotation and that are aimed at matching the full length (or near) of the sequence.
Mapping high throughput sequencing data - Blat, Bowtie, BWA, MAQ, TopHat, Mummer,... - there are over a hundred of tools available - Illumina, 454, IonT, Sanger, GridIon/MinIon, PacBio,... - some of them are extremely fast and some of them are accurate different mappers give different results! - the two most cited short read aligners are Bowtie and BWA http://en.wikipedia.org/wiki/list_of_sequence_alignment_software#short- Read_Sequence_Alignment http://wwwdev.ebi.ac.uk/fg/hts_mappers/
Burrows-Wheeler algorithm (transform) http://www.homolog.us/animation/bwt-b.html Compression techniques work by finding repeated patterns in the data and encoding the duplications more compactly. The Burrows Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data. Working with short-read aligners - create an index for a set of FASTA files obtained from any source - align your reads - analyze SAM and BAM alignment files (SAMtools)
Maq Eland Soap Bowtie BWA Soap2
BWA (Burrows-Wheeler Aligner) http://sourceforge.net/projects/bio-bwa/files/ BWA-MEM: For 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences BWA-backtrack: For short sequences BWA-SW: may have better sensitivity when alignment gaps are frequent. For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWAbacktrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm. bwa index ref.fa bwa mem ref.fa reads_f.fastq reads_r.fastq > aln-pe.sam
Using BWA and Bowtie2 bwa index lambda_virus.fa bwa mem lambda_virus.fa reads_f.fastq reads_r.fastq > bwa-mem_pe.sam bowtie2-build lambda_virus.fa lambda_virus bowtie2 -x lambda_virus -1 reads_f.fastq -2 reads_r.fastq > bowtie2_pe.sam http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMTOOLS http://sourceforge.net/projects/samtools/files/ A BAM file is just a SAM file stored in binary. import: SAM-to-BAM conversion view: BAM-to-SAM conversion and subalignment retrieval sort: sorting alignment merge: merging multiple sorted alignments index: indexing sorted alignment faidx: FASTA indexing and subsequence retrieval tview: text alignment viewer pileup: generating position-based output and consensus/indel calling
Using SAMTOOLS Convert and sort: samtools view -bs bowtie2_pe.sam > bowtie2-pe.bam samtools sort bowtie2-pe.bam bowtie2-pe.sorted.bam Create a bam index file: samtools index bowtie2-pe.sorted.bam bowtie2-pe.sorted.bam.bai Try it with both bowtie2 and bwa-mem sam files. Aligned reads (.sorted.bam file) can be viewed in genome browsers (e.g. Artemis). Filter out unmapped reads: samtools view -h -F 4 -b test.bam > test_only_mapped.bam