17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg (hackenberg@ugr.es)

WEB-SERVER MANUAL Contact: Michael Hackenberg (hackenberg@ugr.es) 1

1 Introduction srnabench is a free web-server tool and standalone application for processing small- RNA data obtained from next generation sequencing platforms, such as Illumina or SOLiD. The srnabench tool is the replacement for miranalyzer. This short tutorial is meant to provide a quick start for the web-server. For further details and to obtain the srnabench complete functionality, please see the main manual to install a standalone version of the software. 2 Main menu Figure 1: Main srnabench menu. Home: srnabench main web page with basic information, releases, etc Restart: clean srna analysis window (see Analysis window). Web Manual: link to this manual. Manual: main manual of srnabench. Differential Expression: srnas differential expression analysis (see Differential expression). Helper Tools: tools to parse ENSEMBL and NCBI formats (see Helper tools). Cite: software publication. FAQs: link to frequently asked questions. 2

3 Analysis window Figure 2: Analysis window. 3.1 Input data Figure 3: Input data. The datasets can be provided uploading a file from a local computer (i) or by means of an URL (ii). In case of big files, they must be gzip compressed and must be provided by means of an URL, because a file sizes limit has been included on the POST method. Several input formats are accepted. In general, all formats can be compressed with gzip: WARNING: Compressed files need 'gz' extension. fastq (or fastq.gz) read count format: tab separated format with read sequence in the first column and read count in the second. fasta: the identifier field of the fasta format must encode the read count. In general, srnabench will expect >readid#read_count (using '#' as separator). Bowtie alignment files: Bowtie alignment files can be used, these files must have 'bowtieout' extension. 3

sra files 3.2 Select species Figure 4: Select species. srnabench can be used in two different modes: genome mapping (ii) or library mapping (i.a or i.b). NOTE: srnabench can treat an unlimited number of species simultaneously. The main application of this feature might be in the analysis of virus infection and the study of host/parasite interactions. i. There are two ways to use library mode: a. A species is selected and the 'Do not map to genome (Library mode)' checkbox is activated. Then, srnabench will use the annotations from the srnabench database for the selected species during the reads mapping. b. No species is selected and the 'Do not map to genome (Library mode)' checkbox is not activated. In this case, srnabench will only analyse micrornas (check MicroRNA analysis). ii. If a species is selected and the 'Do not map to genome (Library mode)' checkbox is not activated, srnabench will use the genome mapping mode. 4

3.3 Adapter removal Figure 5: Adapter removal. srnabench can perform the adapter trimming. The web-server version will by default search for the first 10 bases of the adapter (ii) allowing a maximum of 1 mismatches (iv). It is recommended to provide the adapter sequence (v) or select one of the options given by the application, which are the most common adapters used on microrna analysis (iii): Illumina RA3, Illumina (alternative) or SOLiD (SREK). If the adapter is not known, although it is not recommended, guess the adapter sequence (i) option should be activated. Then, srnabench will align the first 250,000 reads to the genome using the bowtie seed functionality (the adapters will not count for the mismatches). Out of all aligned reads, the adapter sequence is defined as the most frequent 10-mer starting at the first mismatch. And lastly, when the adapter is sequenced at the very end of the read, sometimes its length is shorter than the length threshold (ii), so it must be search in a recursively way without taking into account the minimum length (vi). NOTE: recursive adapter trimming is crucially when the reads have a length of 36 bp and small RNA populations between 27 and 34 bp should be analysed. 3.4 MicroRNA analysis Figure 6: microrna analysis. By default, the microrna analysis is done for all the species selected during the Select species (ii) step (i). During the analysis, usually srnabench will also try to assign the reads to other srna types. However, if no species is selected and the 'Do not map to genome (Library mode)' checkbox is not activated, as it was commented on Select species (i.b) step, srnabench will only analyse micrornas. In this case, the species names (like 'hsa', 'ebv', etc.) of the mirbase nomenclature must be provided in the 5

microrna analysis menu (ii). In addition, srnabench can try to detect putative homologous micrornas based on sequence similarity. This option can be activated providing a string with : separated short species names ( hsa:rno:mmu ) or typing all (use the entire mirbase database, except the species included in (ii) or those from the Select species (ii) step) in the text field Analyse homologous micrornas (iii). By default, this text field is empty, and the homologous micrornas are not analysed. 3.5 Parameters Figure 7: Alignment parameters. The srnabench server also allows choosing the parameters that will be use during the alignment process: (i). Fastq input format could be SOLiD (activated) or Illumina (by default). (ii). Length of the 5 end of the read that would be aligned either to the genome or libraries (seed). (iii). Read count threshold: reads with lower counts are filtered out. (iv). Alignment type: seed alignment ( n mode) or the whole read will be used for alignment ( v mode). NOTE: the seed length would be omitted if the v mode is chosen. (v). Allowed number of mismatches during the alignment. (vi). Barcode trimming: number of nucleotides that need to be removed from the 5 end of the reads before the alignment. 6

3.6 Upload user files Figure 8: Upload user library files. srnabench web-server allows the user uploading library files not included on the server, for example a new microrna library for species not included on the database or other RNA types neither included. The libraries can be uploaded from a file on the local computer or by means of an URL; the accepted formats are fasta or bed. 3.7 Working example Figure 9: Working example. To show the usefulness of srnabench, we processed a publically available small RNA dataset from the BC-1 cell line. BC-1 is a primary effusion lymphoma (PEL) from human b-cell, which is caused by Kaposi's sarcoma-associated herpes-virus (herpesvirus type 8, HHV-8) and frequently also harbors Epstein-Barr virus (EBV). The expression of HHV-8 and EBV micrornas in PELs suggests a role for these micrornas in viral latency and lymphomagenesis. A brief description of the protocol can be found in the dataset link: 18-25nt long small RNAs were gel purified from 50 mg total RNA and subjected to small RNA cdna library preparation protocol with barcoded 5 adaptors (BC-1: TCAAG, BC-3: TTGGC, BCBL-1: GCCTA). Resulting PCR products were purified from 10% TBE gels, pooled and sequenced on one lane on the Illumina GAII platform. Seeing the dataset description and the protocol explanation, to process this dataset we will follow the following steps: Copy and paste the BC-1 dataset link into the URL input data textbox (see Input data). As the dataset could present mirnas for three different species, all of them must be chosen on the Select species menu: human (hg19), Epstein Barr Virus (NC_007605) and Human herpesvirus 8 (KSHV) (NC_009333). Illumina GAII platform has been used to sequence the srnas, we cannot be sure 7

which adapter has been used, so the guess adapter option will be activated in the Adapter removal options. Three different barcodes have been added to the 5 end of the reads with a fixed length of 5 nt. These barcodes must be trimmed before trying to align the reads, so remove barcode option at the Parameters section will be set to 5. Moreover, taking into account that the reads have 36 nt length, out of which 5 nt correspond to the barcode will have at the most 31 nt useful information. Therefore we cannot use the default minimum adapter length as this would imply that we can only profile small RNAs equal or shorter than 31nt -10nt = 21nt. Therefore, we will set the minimum adapter length to 6nt allowing the profiling of small RNAs up to 25 nt. As the adapter length is quite short the allowed max. number of mismatches in adapter detection will be reduced to 0. 4 Differential Expression Figure 10: Differential expression analysis. srnabench differential expression analysis is based on edger package, although the standalone version has numerous parameters, the web-server version will run the analysis with its default parameters (see standalone manual). First of all, each sample must be processed independently and the srnabench ids for each process should be kept by the user. The web version needs at least two samples for each group that will be compared (for example, a case/control study). Once each sample has been processed with srnabench, the ids of the analysis should be included in the differential expression text field section (i) (as it is shown in the example, the groups to be compared must be separated by # and the samples ids within each group by : ). 8

5 Helper Tools The srnabench suite is completed with some useful tools to parse some common file formats (NCBI, Ensembl and gtrnadb) to the srnabench accepted format, which is a simple fasta with the transcript name and its classification separated by :. EXAMPLE: > NR_031589:microRNA GCGTTGGCTGGCAGAGGAAGGGAAGGGTCCAGGGTCAGCTGAGCATGCCCTCAGGTTG CTCACTGTTCTTCCCTAGAATGTCAGGTGATGT NOTE: Please remember citing any RNA annotation source that you finally decide to use on your research: Ensembl database, NCBI RefSeq database, genomic trna database or other... 5.1 Parse Ensembl Fasta Files Figure 11: Parse Ensembl Fasta Files. The Ensembl format is also a simple fasta format, but with a more complex identifier than the srnabench one. In order to convert the Ensembl format, a file (i) or directly an URL (ii) from the FTP Ensembl database can be provided (for example, the cdna annotated for Canis lupus familiaris). The Ensembl plant format is a bit different, so for plant annotations Ensembl Plant? (iii) should be chosen. 5.2 Parse NCBI RefSeq Figure 12: Parse NCBI RefSeq. 9

As on the Ensembl parse tool, the NCBI file can be provided from the local computer (i) or by means of an URL (ii). The NCBI RefSeq annotation can be obtained from the NCBI FTP Database (for example, the Bos taurus annotated RNAs can be provided). In this case, to include the RNA classification to the srnabench file, the Add RNA classification (iii) must be set on. 5.3 Parse trna from the genomic trna database Figure 13: Parse trna from the Genomic trna Database. In order to include trna information into the analysis, a tool is available to download the genomic trna database information for the species provided in the text field (i) (the species names must be separated by _, for example: Homo_sapiens, Mus_musculus, etc ). 10