Référence : cahier_realisation_mini_projet-sepia-theba-1.0 Page : 1/8 Cahier de réalisation SEPIA THEBA REDACTION Nom, prénom Gildas Le Corguillé Erwan Corre Unité ABiMS ABiMS Version Date Nature des modifications 1.0 10-07-13 création HISTORIQUE DU DOCUMENT
SEPIA THEBA Page : 2/8 A. Nature de la demande < Description succincte > Réalisation d un transcriptome de référence de l espece Theba pisana. comparaison de ce transcriptome avec le transcriptome de Sepia officinalis < Enumération au plus haut niveau des attendus du projet par ordre décroissant de priorité. > Besoins (par ordre décroissant de priorité) Formulé à l'itération N... : 1 Elaboration d un transcriptome de référence de T. pisana 1 2 Comparaison des transcriptomes S. officinalis et T. pisana 2 B. Limites du projet < Description des demandes connexes à celles du projet, mais non prises en charge dans les livrables. > C. Adresses Web de gestion de projet Page principale Sources Tableau de bord D. Livrables < Tableau énumérant chaque livrable du projet, avec une indication sur sa nature (base de donnée, application Web, document, service ), le (les) besoin(s) auquel (auxquels) il répond, ainsi qu une brève description. > Nom Nature Besoin(s) concerné(s) Description 1 Theba transcriptome Fastq files 1 Theba de novo transcriptome assembly 2
SEPIA THEBA Page : 3/8 E. Itérations
SEPIA THEBA Page : 4/8 1. Itération 1 A. Descriptif < Un descriptif par itération. > Date de début Date de fin prévue 08-07- 13 12-07- 13 Date de fin effective Livrable(s) concerné(s) 1 B. Step 1 Quality check Command Line fastqc - t 8 - o /projet/sbr/tara/work/tmp/raw/theba/raw/ /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1.fq fastqc - t 8 - o /projet/sbr/tara/work/tmp/raw/theba/raw/ /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2.fq no adapter sequences in the raw file and few sequences corresponding to mitochondrial 16rRNA 2 Cleaning 2.1 first printseq- lite step Parameters - Trim poly- N tail with a minimum length of 1 at the 5'- end. (- trim_ns_right) - Trim poly- N tail with a minimum length of 1 at the 3'- end. (- trim_ns_left) - Filter sequence with more than 0 Ns. (- ns_max_n) - Trim sequence by quality score from the 3'- end with this threshold score : 20 (- trim_qual_right) - Filter sequence with quality score mean below 30 (- min_qual_mean) - Filter sequence shorter than 50 (- min_len) - Filter sequence with characters other than A, C, G, T or N. (- noniupac) prinseq- lite.pl - trim_ns_right 1 - trim_ns_left 1 - ns_max_n 0 - trim_qual_right 20 - min_qual_mean 30 - min_len 50 noniupac - fastq /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1.fq - out_good /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_good.fq - out_bad /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_bad.fq prinseq- lite.pl - trim_ns_right 1 - trim_ns_left 1 - ns_max_n 0 - trim_qual_right 20 - min_qual_mean 30 - min_len 50 noniupac - fastq /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2.fq - out_good /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_good.fq - out_bad /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_bad.fq
SEPIA THEBA Page : 5/8 Input and filter stats for MsPp1_l1_1.fq : Input sequences: 27,184,390 Input bases: 2,446,595,100 Input mean length: 90.00 Good sequences: 27,148,012 (99.87%) Good bases: 2,443,270,870 Good mean length: 90.00 Bad sequences: 36,378 (0.13%) Bad bases: 3,274,020 Bad mean length: 90.00 Sequences filtered by specified parameters: ns_max_n: 36378 Input and filter stats for MsPp1_l1_2.fq : Input sequences: 27,184,390 Input bases: 2,446,595,100 Input mean length: 90.00 Good sequences: 27,176,020 (99.97%) Good bases: 2,445,841,610 Good mean length: 90.00 Bad sequences: 8,370 (0.03%) Bad bases: 753,300 Bad mean length: 90.00 Sequences filtered by specified parameters: ns_max_n: 8370 2.2 second printseq- lite step Parameters - Trim poly- A/T tail with a minimum length of 5 at the 5'- end. (- trim_tail_left) - Trim poly- A/T tail with a minimum length of 5 at the 3'- end. (- trim_tail_right) - Method to filter low complexity sequences. The current options is entropy (lc_method) - The threshold value (between 0 and 100) used to filter sequences by sequence complexity. The dust method uses this as maximum allowed score and the entropy method as minimum allowed value : 70 (- lc_threshold) - Filter sequence shorter than 50 (- min_len) prinseq- lite.pl - trim_tail_left 5 - trim_tail_right 5 - lc_method entropy - lc_threshold 70 - min_len 50 - fastq /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_good.fq - out_good /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_good2.fq - out_bad /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_bad2.fq prinseq- lite.pl - trim_tail_left 5 - trim_tail_right 5 - lc_method entropy - lc_threshold 70 - min_len 50 - fastq /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_good.fq - out_good /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_good2.fq - out_bad /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_bad2.fq Input and filter stats or MsPp1_l1_1.fq Input sequences: 27,148,012 Input bases: 2,443,270,870 Input mean length: 90.00 Good sequences: 25,387,568 (93.52%) Good bases: 2,281,657,234 Good mean length: 89.87 Bad sequences: 1,760,444 (6.48%) Bad bases: 158,436,820 Bad mean length: 90.00 Sequences filtered by specified parameters: trim_tail_left: 2715 trim_tail_right: 116 min_len: 13344 lc_method: 1744269 Input and filter stats for MsPp1_l1_2.fq Input sequences: 27,176,020 Input bases: 2,445,841,610 Input mean length: 90.00 Good sequences: 25,422,144 (93.55%) Good bases: 2,284,841,088 Good mean length: 89.88 Bad sequences: 1,753,876 (6.45%) Bad bases: 157,848,827 Bad mean length: 90.00 Sequences filtered by specified parameters: trim_tail_left: 2055 trim_tail_right: 94 min_len: 12213 lc_method: 1739514 2.3 Ribosomal sequence cleaning
SEPIA THEBA Page : 6/8 ribopicker.pl - f /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_good2.fq - dbs rrnadb - out_dir /projet/sbr/tara/work/tmp/raw/theba/raw/ ribopicker.pl - f /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_good2.fq - dbs rrnadb - out_dir /projet/sbr/tara/work/tmp/raw/theba/raw/ MsPp1_l1_1_nonrrna.fq : 23473103 reads (83,34% of the raw reads) MsPp1_l1_1_rrna.fq : 1914465 reads (7,54% of the cleaned reads) MsPp1_l1_2_nonrrna.fq : 23510076 reads (86,48% of the raw reads) MsPp1_l1_2_rrna.fq : 1912068 reads (7,52% of the cleaned reads) 2.4 Pairing resulting reads for normalized by kmer trinity assembly and expression value estimation count get_pairs.py /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_nonrrna.fq /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_nonrrna.fq MsPp1_l1_1_nonrrna.paired.fq : 22752577 reads MsPp1_l1_2_nonrrna.paired.fq : 22752577 reads MsPp1_l1_1_nonrrna.unpaired.fq : 720526 reads MsPp1_l1_2_nonrrna.unpaired.fq : 757499 reads 3 Assembly 3.1 paired de novo assembly /usr/local/genome2/trinityrnaseq/trinity.pl - - seqtype fq - - output /projet/sbr/tara/work/tmp/raw/theba/trinity_assembly/ - - seqtype fq - - left /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_nonrrna.fq - - right /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_nonrrna.fq - - CPU 10 - - JM 100G Trinity.fasta 66206 For comparison a denovo assembly with the raw reads (without cleanning) produce Trinity_raw_reads.fasta 23580832 5007 356.17 24357398 4559 356.81 68264 3.2 normalized by kmer assembly /usr/local/genome2/trinityrnaseq_r3-02- 25/util/normalize_by_kmer_coverage.pl - - seqtype fq - - JM 100G - - max_cov 30 - - left /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_1_nonrrna.paired.fq - - right
SEPIA THEBA Page : 7/8 /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_nonrrna.paired.fq - - output /projet/sbr/tara/work/tmp/raw/theba/trinity_assembly_norm/ - - pairs_together - - PARALLEL_STATS - - JELLY_CPU 10 Paired reads normalized : MsPp1_l1_1_nonrrna.paired.fq.normalized_K25_C30_pctSD100.fq MsPp1_l1_2_nonrrna.paired.fq.normalized_K25_C30_pctSD100.fq /usr/local/genome2/trinityrnaseq/trinity.pl - - seqtype fq - - output /projet/sbr/tara/work/tmp/raw/theba/trinity_assembly_norm/ - - seqtype fq - - left /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l 1_1_nonrrna.paired.fq.normalized_K25_C30_pctSD100.fq - - right /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1_2_nonrrna.paired.fq.normalized_k25_c30_pctsd100.fq - - CPU 10 - - JM 100G Trinity_norm.fasta 64144 22459171 4584 350.14 4 Annotation Trinotate annotation process including : blastp versus uniprot, hmmsearch against pfam, tmhmm search, signalp search /usr/local/genome2/scripts/trinotatewrapper/trinotatewrapper 2 excel files Trinity_annotation_report.xls Trinity_norm_annotation_report.xls 5 Expression 5.1 Concatenation of singles reads cat MsPp1_l1_1_nonrrna.paired.fq MsPp1_l1_1_nonrrna.unpaired.fq MsPp1_l1_2_nonrrna.unpaired.fq > MsPp1_l1_12_nonrrna.single.fq 5.2 remapping and counting Executed for both trinity assemblies (normalized and not normalized) /usr/local/genome2/trinityrnaseq/util/rsem_util/run_rsem_align_n_estimate.pl - - transcripts Trinity.fasta - - seqtype fq - - single /projet/sbr/tara/work/tmp/raw/theba/raw/mspp1_l1 _12_nonrrna.single.fq 2 counting files for isoforms and genes for each Trinity assembly column description 'TPM' stands for Transcripts Per Million. It is a relative measure of transcript abundance. The sum of all transcripts' TPM is 1 million.
SEPIA THEBA Page : 8/8 'FPKM' stands for Fragments Per Kilobase of transcript per Million mapped reads. It is another relative measure of transcript abundance. 'IsoPct' stands for isoform percentage. It is the percentage of this transcript's abandunce over its parent gene's abandunce. If its parent gene has only one isoform or the gene information is not provided, this field will be set to 100. RSEM.isoforms.results RSEM.genes.results 5.3 filtering of low expressed transcripts and rare isoformes To filter out the likely transcript artifacts and lowly expressed transcripts, we consider retaining only those that represent at least 1% of the per- component (IsoPct) expression level and those with FPKM values > 1 for both assembles /usr/local/genome2/trinityrnaseq/util/filter_fasta_by_rsem_values.pl - r RSEM.isoforms.results - f Trinity.fasta - o Trinity_filtered.fasta - - fpkm_cutoff=1 - - isopct_cutoff=1.00 4 files Trinity_filtered.fasta.rsem : assembly expression values Trinity_filtered.fasta : assembly filtered Trinity_norm_filtered.fasta.rsem : normalized assembly expression values Trinity_norm_filtered.fasta : normalized assembly filtered Trinity_filtered.fasta Trinity_norm filtered.fasta transcripts) 63082 (vs. 66206 initial transcripts) 60599 (vs. 64144 initial 22274567 21034300 5007 353.10 4584 347.11 Overall results of the iteration 1 Data are accessibles on this web page : http://application.sb- roscoff.fr/download/fr2424/abims/corre/theba/