Data formats and file conversions

Transcription

1 Building Excellence in Genomics and Computational Bioscience s Richard Leggett (TGAC) John Walshaw (IFR)

2 Common file formats FASTQ FASTA BAM SAM Raw sequence Alignments MSF EMBL UniProt BED WIG Databases Annotation GenBank VCF GFF

3 FASTQ files e.g. Illumina read files 4 lines per read Stores sequence and quality information Read ID Sequence 1:N:0:GCCAA ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG + 1:N:0:GCCAA CTNGAATGCAGGTAGAATACATCTCCCGGATAAGCCTCGCGGCCCCCGGGGCGGGGGGGGAGAG + 1:N:0:GCCAA GGNAAATACGAAAGATAAGCTACGCAAGAAACGAAGGATTACTGCGAAAGGCTGCGATGCGGCA

4 FASTQ files Sanger format quality scores 0-93 Encoded with ASCII characters Older versions of Illumina software slightly different

5 FASTQ files Q score relates to probability, p, that base is incorrect: What this means

6 FASTA files e.g. assembler contigs Stores ID and sequence data only Sequence data can cover multiple lines Sequence ID Sequence >contig1 ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG CTGAGTCTCGTATCCGTGACGGTTAGGGCGATTAGCATAGA >contig2 TGACTAGCGGATTCGATTCGGAGGCTTATGGGCATTCCAGATGCAGCTAGCAGATGACATAGAT GGGCATT >contig3 CCCCCCTGACTAGCGGATTCGGTTCAGCATGAGTACGAATTCGGAGGCTTATGGGCATTCCAGA AGCGTGCAGCTAGCAGATGAAGCGCATAGATGGGCTATTGTTCAGCATGAGCTGATCAACTACG TACGGGACTGAGATGCCATGCAGTTGG >contig4 TGACTAGCTAGTGGATTGACGAC

7 Manipulating FASTA and FASTQ files Numerous options: FASTX toolkit conversion, quality statistics, clipping, renaming, trimming, reverse compliment, formatting & more. NGSUtils suite of utils for working with NGS datasets. EMBOSS sequence analysis package mature package which can do a lot. Many other programs/scripts or collections of scripts are available for common tasks Google can help find them! Simple manipulations possible even with one-line commands in UNIX/Linux shells see Introduction to Linux session!

8 FASTQ to FASTA conversion Using FASTX Toolkit $ fastq_to_fasta h usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE] version [-h] [-r] [-n] [-v] = This helpful help screen. = Rename sequence identifiers to numbers. = keep sequences with unknown (N) nucleotides. Default is to discard such sequences. = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR. = Compress output with GZIP. [-z] [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA output file. default is STDOUT. $ fastq_to_fasta Q 33 i file.fastq o file.fasta

9 Interleaving FASTQ files No one killer app: shufflesequences_fastq.pl comes with Velvet in the contrib directory. Interleave_fastq.py Example with shufflesequences: shufflesequences_fastq.pl file_r1.fastq file_r2.fastq file_r1r2.fastq Don t often need to go back, but popgentools has a script called split-interleaved-fastq.pl.

10 Splitting FASTA/Q files into chunks For example, to spread alignment load. For FASTA files: Using fastasplit (Exonerate) fastasplit f in.fasta o outdir -c 100 For FASTQ files: As long as not multi-line FASTQ, can use Linux split command: split -l 1000 in.fastq outprefix_ Using NGSUtils: fastqutils split in.fastq outprefix_ 100

11 Exercise: FASTQ/FASTA 1. Convert the file example.fastq in the Documents directory into a FASTA file. 2. Interleave the two LIB6574 files inside Documents/reads to make a single FASTQ file. 3. Split the file exreads.fastq in the Documents directory into 5 (approximately) chunks. 4. Split the file example.fastq in the Documents directory into 3 (approxiamtely) chunks.

12 Sequence databases Primary nucleotide DBs have their own native formats ENA db: EMBL format NCBI Nucleotide db ( Genbank ): Genbank format DDBJ: DDBJ format very similar to Genbank Primary protein DBs likewise: UniProt Knowledgebase: Swiss-Prot format Essentially the same as EMBL format NCBI Protein db: Genbank format ( Genpept ) Most sequence DBs will also provide the data in FASTA format Other DBs (e.g. for a particular genome-sequencing project) might use their own or standard formats

13 Exercise: Sequence databases (1) We will query ENA for some entries representing (partial) gene sequences of Purple Osier Willow Obtain an entry in native ENA ( EMBL ) format And FASTA format And repeat the query in the NCBI Nucleotide DB to obtain the equivalent record in Genbank format In a different search, we will query the Sequence Read Archive (SRA) to obtain FASTA- and FASTQ-format data from the genome-sequencing project of the same Willow We will use the NCBI implementation of SRA (the ENA or DRA versions could be used for the same search) This sequencing project used 454 sequencing keeps the data sets (relatively) small This kind of data is made available in compressed files so we will uncompress and examine the files

14 Exercise: Sequence databases (2) Search ENA for: Salix purpurea Examine the hit-list of coding sequences Choose an entry representing a whole (not partial) gene Obtain native (EMBL) format and FASTA-format files of this Make a note of the Accession number of the record Extra exercise if you have time: Find, examine and download in Swiss-Prot format this UniProtKB entry Examine the EMBL-format record: Can you see cross-references to other databases? Any to the UniProt KnowledgeBase? Make a note of any cross-reference to UniProtKB which you see.

15 Exercise: Sequence databases (3) Change All Databases to Nucleotide and search for Salix purpurea To narrow down the hit list, click Advanced (under the search box) Restrict the search to: Organism = Salix purpurea Entries which do NOT have partial cds in any field How many of the hits appear to be proteincoding sequences? The entry equivalent to the one found in the ENA search should be in the list. What is its Accession number? Examine the record Click on Send to download the entry in Genbank format

16 Exercise: Sequence databases (4) Obtaining read data sets (FASTA and/or FASTQ) from SRA - change DB to search to SRA; search for Salix purpurea The hit list is a list of sequencing experiments Accession of an SRA experiment begins with SRX Among the hit list look for those annotated as random whole genome shotgun library Note that these are 454 (GS FLX) sequence reads each set is much smaller than the other (Illumina, GA II) Pick the smallest experiment (read set) (should take you here:

17 Exercise: Sequence databases (5) Each experiment is associated with one or more sequencing runs. This experiment has only one run. Click on the link (SRR070318) Click the Reads tab. Individual reads can be examined. But here we will download the set in bulk. Click on the Filtered Download button Select clipped and FASTA ; click Download This will deliver the whole set of reads (auto quality-clipped) in a single compressed (gzipped) file The Linux (Ubuntu) archive manager should automatically provide access to the contents of this compressed file It can be examined e.g. in a text editor Then repeat, but this time obtain the FASTQ-format file

18 Alignments SAM format Sequence Alignment/Map BAM format binary version of SAM (compressed, more efficient) Use SAMtools to process. Reference T G C T T A G T C C T T A G T C T A C T A G T Reads C T T A G T C C C T T G G T C T Insertion C T A A G C T A SNP? Error?

19 The SAM file Flags Pos CIGAR Read Optional fields Read1 0 TheRef M * 0 0 CTTAGTCC EEDDEEDE AS:i:8 XS:i:0 Read2 16 TheRef M * 0 0 CTTGGTCT FFEEDDEE AS:7 XS:i:0 Read3 0 TheRef M2I3M * 0 0 CTAAGCTA GGGHHHHH AS:i:5 XS:i:0 Read ID Ref ID MAPQ Mate Qualities Reference T G C T T A G T C C T T A G T C T A C T A G T Reads C T T A G T C C Insertion C T T G G T C T C T A A G C T A SNP? Error?

20 SAMtools SAMtools tools: view filter SAM or BAM sort sort according to position on reference index create fast look-up of BAM or SAM tview text viewer for alignments mpileup generate pileup (BCF) file, eg. for SNP calling merge merge sorted alignments rmdup remove potential PCR duplicates and more For more info:

21 Multiple Sequence Alignments Related but different aims, meanings and file formats Sequence read alignment ( assembly ) Multiple protein or nucleotide sequence alignment Each nucleotide position (column) represents multiple copies of the same base of an original sequence (e.g. genome sequence) Each position (column) represents a homologous nucleotide (or amino acid). Sequences are evolutionarily related (homologous) sequences, typically from different organisms, and/or multiple members of a gene family Gaps represent insertions/ deletions

22 Multiple Sequence Alignments Various file formats for MSA A multiple alignment can be represented in FASTA format MSA-dedicated formats are more richly annotated and more flexible for some purposes MSF Stockholm Selex and others Each nucleotide or amino acid, and indel, is represented explicitly C.f. SAM/BAM

23 Multiple Sequence Alignments MSF Stockholm

24 Automation saves effort and prevents errors Many (but not all) sequence formats are flatfiles they consist of plain-text characters It may be convenient to: Examine a file s contents, e.g. UNIX/Linux less Text editor, e.g. gedit Can be useful as a quick sanity check perform a single operation on a single sequence manually But if even a simple manual operation is to be repeated many times, errors are likely Manual operations likely to be infeasible for large sequence sets Or possible, but very timewasting If you find yourself doing something repetitive using interactive tools, ask yourself if there might be an easier way Often the answer is, there must be an easier way

25 Automation saves effort and prevents errors Repetitive chains of operations: Data set A, in file A1 reformat filea1 filea2 Input filea2 into tool X (output) filea3 Reformat filea3 filea4 Input filea4 into tooly -> (output) filea5 Next week, repeat on Data set B Use automated pipelines Re-useability of analysis steps/tools In different combinations for different purposes Ideally, records each input/output process E.g. GALAXY

26 The (t)errors of cut-and-paste A real-world example (but not with this actual sequence) A plant scientist working on a particular gene/protein asked a bioinformatician colleague to do some analyses on the protein sequence, along with those from the same family in related plants. The sequences were ed to the bioinformatician. Unsurprisingly, the family of proteins exhibited numerous amino acid substitutions, and insertions/deletions It was noticed that one sequence alone had two instances of an inserted dipeptide, Phenylalanine-Threonine. These were 59 amino acids apart, and appeared to be absent from all related proteins in the databases.

27 The (t)errors of cut-and-paste >WillowMatK FSDSAIIDRFVRICRNLSHYYSGSSRKKSLYRIKYILRLSCVKTLFTARKHKSTVRIFLK RLGSELLDEFFTEEEQILFLTFPRVSSISQKLYRGRVWYLDIICINFTELSNHE ID AJ849584; SV 1; linear; genomic DNA; STD; PLN; 622 BP. DE Salix purpurea chloroplast partial trna-lys gene intron and partial matk DE gene for maturase K, clone A XX KW matk gene; maturase K; trna-lys. XX" FT /gene="matk" FT /product="maturase K" FT /db_xref="goa:a0zvw3" FT /db_xref="interpro:ipr024937" FT /db_xref="uniprotkb/trembl:a0zvw3" FT /protein_id="cah " FT /translation="fsdsaiidrfvricrnlshyysgssrkkslyrikyilrlscvktl FT ARKHKSTVRIFLKRLGSELLDEFFTEEEQILFLTFPRVSSISQKLYRGRVWYLDIICIN FT ELSNHE" XX SQ Sequence 622 BP; 205 A; 123 C; 97 G; 197 T; 0 other; gggttgcccg ggactcgaac ccggactagt cggatggagt agagaatttc tttgttaaaa 60

28 Where to get software FASTX Toolkit: NGSUtils: EMBOSS: Exonerate: Velvet: Interleave_fastq.py: popgentools: SAMtools:

29 Thank you Any questions?