A Comprehensive metatranscriptomics analysis pipeline and its validation using human small intestine microbiota metatranscriptome

Transcription

1 A Comprehensive metatranscriptomics analysis pipeline and its validation using human small intestine microbiota metatranscriptome NBIC: 3 rd Metagenomics Seminar Utrecht / September 25 th, 2012 Javier Ramiro Garcia Mark Davids Peter Schaap Wageningen University

2 Aims To develop a fast and robust bioinformatics pipeline for metatranscriptome analysis To validate the pipeline using human small intestine microbiota metatranscriptome samples

3 Human gastro-intestinal tract microbiota Microbial cells outnumber host cells by 10-fold Related with diabetes, obesity and intestinal disease > 1000 species of microbes in the gut ecosystem Large (~80%) uncultured fraction

4 Unexplored small intestinal microbiota Colon Good accessibility Well studied microbiota Small intestine Poorly accessible Relatively unexplored

5 Molecular approaches to study microbial communities Zoetendal et al., 2008, Gut 57:

6 Sampling the human small intestine Surgical removal of colon Or Small Bowel Transplant (SBT) e.g: Crohn s disease, Ulcerative colitis, cancer Healthy subject - Invasive (Sampling with catheter) - Limited amount of material - One time point sampling VS Ileostomy subject - Non invasive (sampling of luminal microbiota of the distal ileum) - Sufficient amount of material (up to 100ml) - Repeated sampling dietary intervention

7 Experimental design for microbiota analysis mrna ds cdna RNA sequencing Illumina (reads ~100bp) RNA extraction 9-42 millions / sample Ileostomy effluent 16S rrna RT-PCR Bioinformatics analysis Pyrosequencing DNA extraction 16S rdna PCR 16S rrna (activity) 16S rdna (community) Bacteria (meta) genome mrna (function/pathways)

8 Samples Single end reads: A : 29,709,278 reads A-rep : 8,951,083 reads Technical replicate Paired end reads B-left & B-right : 42,211,887 each

9 General layout of the RNA-seq data pipeline Sequencing reads (Pyrosequencing or Illumina) Quality check Assignment of sequencing reads to genes or proteins using a blast algorithm Annotation of the identified genes or proteins (COG/KEGG) Blast algorithm (blastx) NCBI Database of bacteria Accurate gene/protein assignment

10 Number of reads (raw) Number of reads (raw) Number of reads (raw) Number of reads (raw) Illumina reads quality (FastQC) A poor average high poor average high Mean sequence quality (Phred score) Mean sequence quality (Phred score) A-rep poor average high B-left B-right poor average high Mean sequence quality (Phred score) Mean sequence quality (Phred score) Between 71-83% of the sequencing reads are high quality

11 Computational time employed by RNA-seq pipeline Illumina reads QC Illumina reads (100bp) Large number of input sequences NCBI database of bacteria BlastX Putative mrna reads High computational demanding Determination of database Reads assignment to gene/protein Phylogenetic profiling Metabolic mapping

12 Reduction of the database size Will not be performed due to: General pùrpose of the pipeline (should be applicable for other environmental sample) Possibility of excluding some species from the selected database, that can be presented in the samples

13 Sequencing reads (%) Reduction number of reads by pooling Sample 1 Sample 1 Sample 1 Uniq-Seq Reduction of reads after pooling Header _a CACT Header _b TACG Header _c AACT Header _d GCGC Header _e CACT Header _f AACT Header _g TTAG Header _a CACT Header _b TACG Header _c AACT Header _d GCGC Header _e CACT Header _f AACT Header _g TTAG Header _a_2 CACT Header _b TACG Header _c_2 AACT Header _d GCGC Header _g TTAG A A-rep B-left B-right Total reads Unique reads after pooling ~54% reduction for single-end reads ~70% reduction for paired-end reads

14 Filtering process Read without biological function Ribosomal RNA PhyX spiked RNA and Illumina adaptor sequences. Filtering procedure Filter database Megablast Validation FDR= % min. alignment of 28nt Megablast :not fast enough and consume a lot of memory ram development of a new algorithm

15 rrna reads distribution (%) rrna filter development Filter database 100% 28-mers 80% 60% 40% Total rrna reads Non-rRNA reads Pooled Illumina reads filter database 20% 0% A A (rep) B-left B-right rrna reads non-rrna reads >75% of total reads are non-functional

16 Blast strategy Illumina reads QC (FastQC) Pooled Illumina reads rrna filter database rrna reads non-rrna reads NCBI database of bacteria BlastX Putative mrna reads High computational demanding Reads assignment to gene/protein Phylogenetic profiling Metabolic mapping

17 Blast strategy for putative mrna reads assignment Putative mrna reads Megablast Bacteria genome database Reads assigned to genome Reads not assigned to genome Blastn Bacteria genome database Reads assigned to genome Reads not assigned to genome (NAG) 10% of NAG Blastx to NCBI protein database Blastx (Metahit & small intestine databases) Reads assigned to protein (NCBI) Reads assigned to protein (M&SI) Unassigned reads

18 Validation of cut off value Illumina reads QC (FastQC) Pooled Illumina reads rrna filter database rrna reads non-rrna reads Non significant hits sequence of blasting Putative mrna reads Phylogenetic profiling Significant hits Validation of cut off value Metabolic mapping

19 Validation of cut off value NCBI complete bacteria coding region Generate 10,000 random in silico reads of 100bp blast hits/in silico read Grouped by bit score Check the COG & taxonomy match mismatch match match

20 % match Validation of genes and protein assignment Megablast Bitscore For accurate assignment of COG Family Genus 74 bit score with 95% confidence 110 bit score with 80% confidence 148 bit score with 80% confidence Not possible for assignment of species level

21 Abundance Abundance Reads distribution based on blast procedure and bit score 100% 100% 80% 60% 40% 20% 80% 60% 40% 20% 0% A A-rep B-left B-right genome (megablast) genome (blastn) unassign reads 0% A A-rep B-left B-right Bit score 148 Bit score Bit score The majority (>75%) of the reads can be assigned to genome using megablast & blastn Avoiding the use of blastx The majority (>54%) of the genome assigned reads have bit score of 148 Assignment at Genus (80%) Family (97%) COG (~100%)

22 Confidence assignment of the mrna reads to the genus level Selection of genes that belong to gal operon of Streptococcus salivarius CCHS33 Blast all putative mrna reads to these genes 350 Coding region Non-coding region Gal operon galk galt gale galm ~32% of the reads that can be assigned to those genes were belong to S. salivarius CCHS33, while the rest come from other Streptococcus species no other genus detected Increase confidence of genus assignment

23 Number of reads Number of genes Number of reads Number of genes Reads assignment to the genes Distribution of reads that can be assigned to genome 7.E+06 6.E+06 5.E+06 4.E+06 3.E+06 2.E+06 1.E+06 0.E+00 60,773 36,348 91,474 89,876 A A-rep B-left B-right Non-coding assigned reads Total coding assigned reads Total protein encoding genes 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10, E+06 6.E+06 5.E+06 4.E+06 3.E+06 2.E+06 1.E+06 0.E+00 Normalisation & determination of significant genes: - Gene length normalisation - Removal of genes with <0.0005% reads abundance 10,556 9,063 12,646 12,673 A A-rep B-left B-right 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Non-coding assigned reads Non-significant coding assigned reads Significant coding assigned reads Significant protein encoding genes Reduction of large number of genes but only discarding <8% reads from the total gene assigned reads

24 Read counts (%) Gene counts (%) Increase of gene identification accuracy using multiple reads assignment Increase confidence of gene assignment 0 0 Gene length coverage (%) Reads with average bit score of Reads with average bit score of Reads with average bit score of 148 Protein encoding gene

25 Validation of technical replicates and pairedend reads Pearson Correlations 9.84 Pair-end reads = 1 (p<0.01) Single-end replicates = (p<0.01) Robustness of the pipeline for functional annotation of the replicates and paired-end Paired-end matched reads (same genome) Paired-end matched reads (different genome) Unique reads

26 Abundance COG distribution of the genes 100% 80% Metabolism 60% 40% 20% Information, storage and processing 0% A A-rep B-left B-right Robustness of the pipeline for functional annotation

27 Functional analysis (metabolic pathways) Nucleotide metabolism Lipid metabolism Carbohydrate metabolism Amino acid metabolism Energy metabolism

28 Experimental design for microbiota analysis mrna ds cdna RNA sequencing Illumina (reads ~100bp) RNA extraction 9-40 millions / sample Ileostomy effluent 16S rrna RT-PCR Bioinformatics analysis Pyrosequencing DNA extraction 16S rdna PCR 16S rrna (activity) 16S rdna (community) Bacteria (meta) genome mrna (function/pathways)

29 16S rdna 16S rrna 16S rdna 16S rrna Relative abundance Relative abundance Taxonomic distribution at genus level Pyrosequencing 100% Others RNA-seq 100% 80% 60% 40% 20% Unclassified Haemophilus Bifidobacterium Turicibacter Gemella Streptococcus Rothia Lactobacillus 80% 60% 40% 20% 0% Lactococcus Veillonella Clostridium 0% A A-rep B-left B-right A B A B Correlation between microbiota composition, overall activity and specific activity of the community members

30 Final set up of the bioinformatics pipeline Pooled Illumina reads Input Processes Filter database Filter for rrna rrna reads Intermediate output Discarded output Genome database Reads assignment to the genome Non-assigned reads Reads classification Blastx (10%) to protein database NCBI protein database Non-gene assigned reads Gene assigned reads Non-assigned reads after blastx (10%) COG/KEGG database Gene annotations Blastx (10%) to protein database MetaHIT & SI protein databases Metabolic mapping & biological interpretation Unassigned reads

31 Summary Accurate COG functional assignment >95% confidence level Phylogenetic assignment: Genus >80% Family >97% ~ 54% of the genome assigned reads Robustness of functional assignment technical replicates & paired-end reads Correlation between microbiota composition, overall activity and specific activity of the community members

32 Acknowledgement Milkha M Leimena Mark Davids Matthijn C Hesselman Tom vd. Bogert Jos Boekhorst Eddy Smid Erwin Zoetendal Michiel Kleerebezem Peter Schaap Hauke Smidt