Workshop I. The files related to the two different parts of this workshop have been separated into two directories.

Introduction High Throughput Genome Sequencing Processing ChIP Seq Data In this workshop you will be taken through Workshop I Introduction 1. The typical steps you have to follow to identify a biosynthetic gene cluster of interest in the contigs resulting from the high throughput sequencing of a microbial genome. 2. Process and visualize data from a ChIP seq experiment to determine regions of the genome where a DNA binding protein binds. The computers in the computer room in the IUC are connected to a Linux server on which accounts have been created for you. All the software required for you to carry out the tasks of this workshop are installed on this server and is accessible to you. This is because some of the software you will be using in the workshop is not available for MS Windows. The files related to the two different parts of this workshop have been separated into two directories. 1. genome: In this directory you will be doing things related to the Next Generation Sequencing (NGS) data of a new bacterial genome. 2. ChIP Seq: Here you will be doing things related to the analysis of ChIP Seq data. To list files and directories in the current directory use the command. l List the file in the current working directory. List the files in the long format. In this output you can see, amongst other information, the sizes of the listed files. Same as l except that the file sizes are mentioned in lh human friendly format. The table below shows the commands you use for moving from one directory to another. pwd cd cd genome cd.. Display the current working directory Go to your home directory Go to the genome (sub)directory from your home (parent) directory. Go to the parent directory of your current directory. cd ChIP SeqGo to the ChIP Seq (sub)directory from your home (parent) directory. All files which will be used as input to different programs are plain text files. All output file produced by different programs, with the exception of BLAST database and Bowtie2 index files, are ao plain text files. You are encouraged to view both the input and output file to see what they really look like. You may use the file viewer lessto view files. You start less like below less file_to_view 1/10

less wraps long lines by default but sometimes it is better not to wrap them. While you are in less you may press " S" (dash followed by capital S) to prevent less from wrapping long lines. To move up and down the file you can use the arrow keys. You might find it faster to use the "Page Up" and "Page Down" keys to scroll through large files quickly. To exit from less just press "q". For very long files it might be better to just see the top and / or bottom of the files. You can use the programs head and tail for doing so. head file_to_view tail file_to view Use Putty to connect to this server. ec2 46 137 34 59.eu west 1.compute.amazonaws.com as the user mentioned on the label on your computer. We may have already done this for you. Note that the name of the server to connect to during the actual workshop will be different from the one mentioned above. The two screenshots below show approximately what you should get on your screen. 2/10

Unless otherwise mentioned, all the work listed below should be carried out in the terminal window shown in the screenshot above. Any commands given in this window will run on the remote Linux server. In the instructions below, commands that you have to type are shown in brown colour while output of commands is shown in green colour. Filenames are ao mentioned in blue colour. It is possbile to select and copy commands from here and paste (Control V) them into the terminal. You may have to press enter after you have pasted the command into the Putty terminal to set the command running. If you find that control v does not paste whatever you have copied then try right clicking in the Putty terminal. Right clicking should paste your copied command into the Putty terminal. Remember, Linux is case sensitive. "Ls" does not mean the same thing as "". In the commands mentioned below there is a blank line between two commands. Sometimes you will see commands like addseqtoprodigal.pl gbkfile prodigal_out.gbk\ fasta all_contigs.fna > prodigal_with_seq.gbk In such cases the back slash on the first line has been used to break the command over two lines. The back slash is important and it should be the last character on the first line. Cut and paste such commands exactly as they are written (i.e. both lines at the same time). Linux keeps a history of the commands you have typed at the terminal. So you can recall older commands by pressing the up arrow key. To run a recalled command simply press enter. You may edit the command before pressing enter to run a variation of the old command. Now we are ready to start the first part of this workshop. High throughput genome sequencing The sequencers produce reads which are then assembled into contigs by an assembler. Contigs vary in length and for a typical actinobacterial genome you might get 400 to 2000 contigs 3/10

depending upon the quality of reads and the quality of assembly. In this workshop we will 1. Use a GC frameplot based gene prediction tool to predict genes in all the contigs. 2. Make a BLAST database of the contigs and use it to identify those contigs which possibly have NRPS gene(s). 3. Process the output produced when we submit our collection of contigs to an automated annotation server (RAST). 4. Inspect the RAST annotation for a selected contig in Artemis. We start by changing to the appropriate directory. cd cd genome # to change to your home directory # to change to the genome directory # to list the files in the genome directory You should see the following three files listed. all_contigs.fna nrpsprobe.faa prodigal_rast.gbk Use the less command to view all_contigs.fna. less all_contigs.fna When you have finished viewing the file, press "q" to exit from less. The file all_contigs.fnais a fasta file containing the nucleotide sequences of all the contigs we have got from the assembler. This file is the starting material for this part of the workshop. Now we will use prodigal to identify the protein coding sequences in the contigs. prodigal i all_contigs.fna o prodigal_out.gbk Wait for the above command to finish then In addition to the other files, you should now have a file named prodigal_out.gbk. Have a look at this file using the text file viewer less. less prodigal_out.gbk In this file, Prodigal has written, in Genbank format, where it has predicted protein coding regions to be in all of our contigs. This file does not contain any nucleotide sequence but the Genbank file we want to submit to RAST must contain nucleotide sequence as well. So we have to add the respective nucleotide sequences to this file. To do so we use a Perl script addseqtoprodigal.pl. addseqtoprodigal.pl gbkfile prodigal_out.gbk \ fasta all_contigs.fna > prodigal_with_seq.gbk less prodigal_with_seq.gbk 4/10

The file prodigal_with_seq.gbk is fit for submission to the RAST annotation server. RAST usually takes about one day to return the results. So the output which comes back from RAST has been placed in the file prodigal_rast.gbkin the current directory. Have a look at this file. less prodigal_rast.gbk We ao want to be able to check whether proteins similar to proteins of our interest are encoded in the genome we have just sequenced. To do this, we make a BLAST database of all the contigs and then search this database using our proteins of interest (from other organisms) as the query. To make a BLAST database we use the command makeblastdb. makeblastdb in all_contigs.fna dbtype nucl \ title "Summer School 2012 example" out example The above command makes a BLAST database named example. all_contigs.fna example.nin nrpsprobe.faa prodigal_rast.gbk example.nhr example.nsq prodigal_out.gbk prodigal_with_seq.gbk The BLAST database consists of three files example.nhr, example.ninand, example.nsq. To search the database just made with a nucleotide query we can use the command blastn. But our query is a protein sequence so we will use the command tblastn. Our query sequence is the amino acid sequence of an NRPS (SCO3230 from Streptomyces coelicolor). tblastn query nrpsprobe.faa db example evalue 1e 10 out blast.out tblastn query nrpsprobe.faa db example evalue 1e 10 \ outfmt 7 out tabular.out Both the above commands run tblastn but they produce output in the normal BLAST format and in the tabular format respectively. less blast.out less tabular.out The tabular output above does not show the lengths of the hits which is something we might be interested in. So we will use the script blasttable.plto make a table from blast.out which contains the lengths of hits. This script ao sorts the hits in descending order by hit lengths. blasttable.pl in blast.out > blast_sorted less blast_sorted From the contents of the file blast_sorted we can see that contig_546, contig_43 and, contig_83 are three long contigs encoding containing NRPS like proteins. We now extract them from the large Genbank contig file for viewing in Artemis. 5/10

onegbk.pl in prodigal_rast.gbk wanted contig_546 > contig_546_rast.gbk onegbk.pl in prodigal_rast.gbk wanted contig_43 > contig_43_rast.gbk onegbk.pl in prodigal_rast.gbk wanted contig_83 > contig_83_rast.gbk You will be running Artemis not on the Linux server, where you have been doing all the work until now, but on the Windows computer you are using here. To download the three Genbank files to your local computer, right click on the three links below and save the files locally. Windows will try to save these files as.html files. Make sure you save them as.gbk files. 1. contig_546_rast.gbk 2. contig_43_rast.gbk 3. contig_83_rast.gbk For the purpose of this workshop we will focus on contig_83_rast.gbkif time permits, you can ao view the other two files in the same way Now start Artemis by clicking here and open contig_83_rast.gbkin Artemis. In the menu, click on "View" then "CDS Genes And Products". A new window will open. Double clicking on a gene in the new window shows that gene in the main window. Can you find the gene which was picked up in the BLAST with the NRPS probe? Ao make sure you try the GC Frameplot found in the Graphs menu. What do you think the product of this gene cluster is? 6/10

Artemis main window Artemis CDS Genes and Products windows ChIP seq ChIP seq data comes in the form of reads from the sequencer and usually you have more than one set of reads because you want to to make a comparison between two (or more) strains. In the example we are doing today we have used the antibody against the BldM protein of S. venezuelae for the chromatin immunoprecipitation of DNA from two strains. 1. Wild type strain of Streptomyces venezuelae 2. bldm deletion mutant of Streptomyces venezuelae The reads are in the form of fastq files. These reads are aligned to the reference genome using an aligner software. Usually we get a lot of reads, therefore the alignment takes a long time and significant computing power. The tool we use for alignment is known as Bowtie2. Alignment too take both the sequence of the reads and the quality of reads into account when doing the alignment. Before we can use bowtie to align our reads to our reference genome we need to make a bowtie index for the reference genome using the command bowtie2 build. bowtie2 build reads fasta files not genbank files so we need to convert our genbank file to a fasta file before calling bowtie2 build. gbk2fasta.pl in SvSection.gbk out sven.fna bowtie2 build sven.fna sven In addition to other files, you should see the following 6 files listed. Together these six files make up the bowtie2 index which bowtie2 uses to carry out alignments of reads. 7/10

sven.1.bt2 sven.2.bt2 sven.3.bt2 sven.4.bt2 sven.rev.1.bt2 sven.rev.2.bt2 Now that we have built the index of the reference genome we can align the two sets of reads to reference genome using "bowtie2". bowtie2 phred64 x sven U bldm.fastq S bldm_bowtie2 bowtie2 phred64 x sven U dbldm.fastq S dbldm_bowtie2 Now you should have two more files in your directory. bldm_bowtie2 dbldm_bowtie2 head bldm_bowtie2 The output of bowtie2 is in Sequence Alignment/Map (SAM) format which is a standard format for storing the alignment of large numbers of reads to reference nucleotide databases. Now we process the SAM format files produced above to get graph files which can be used to view the coverage in the Integrated Genome Browser (IGB). bowtie2track.pl in bldm_bowtie2 out bldm.gr bowtie2track.pl in dbldm_bowtie2 out dbldm.gr Inspect the tops of the two graph files produced. head bldm.gr head dbldm.gr We will use Integrated Genome Browser (IGB) to view the graph files produced above. We ao need to see the reference genome in IGB at the same time so that we can see where the peaks of binding are in relation to the features in the genome. For this, we need to produce a BED file containing the features of the genome. gbk2bed.pl in SvSection.gbk out sven.bed The above will produce sven.bedwhich can be read by IGB. Now we have a BED file and two graph files. We are ready to start IGB. You will be running IGB on the local machine so the BED and graph files have to be downloaded to your local machine. To download the files to your local computer, right click on the four links below and save the files locally. Remember where you save them. If windows tries to save them as.html files, make sure you save them with the extensions shown below otherwise IGB will fail to recognize them. 1. sven.bed 2. sven.fa 3. bldm.gr 8/10

4. dbldm.gr Start IGB by clicking here. It takes a while to start. Please be patient. Referring to figure 1 below, do the following. 1. First open the file sven.bed 2. Bring the horizontal zoom slider at the top of the window to about the middle of the scale. In the Data Access tab below change the Load Mode from Region to Chromosome. 3. Press the "Load Data" button in the lower right corner. 4. Then open the two graph files, bldm.grand dbldm.gr. You can open them at the same time. 5. For both the graph files, in the Data Access tab, change the load mode from Region to Chromosome and press the "Load Data" button. 6. Do the same for sven.fa. 7. You should now see something similar to the screen shot in Figure 1. Referring to figure 2 below, do the following. Figure 1 1. Select the two graph tracks by pressing shift and clicking on the boxes on the left of the tracks. Arrowheads should now appear to the right of the two boxes you clicked on. 2. With both tracks selected, go to the Graph Adjuster tab below then a. Change the style to Bar b. Change the Max value of the Y Axis Scale to 70 Now your IGB window should look something like the screen shot below. 9/10

Figure 2 Using the "Horizontal Scroll Slider", bring the gene SVEN_4453 into view. Can you tell why the track "dbldm" is completely blank where SVEN_4453 is? Try to use the search tab to search for the genes SVEN_4151 and SVEN_4210. SVEN_3937 is an example where a peak is present in both the wild type as well as the BldM deletion mutant. The region around the gene SVEN_3966 is difficult to interpret. See what other peaks (good and bad) you can find. 10/10