Table of Contents 4.1. DNA Sequencing 4.1.1. Trace Viewer in GCG SeqLab Table. Box. Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu. Select the trace files in ABI or SCF format. Open the traces window to display the traces. The trace viewer window and the editor window. IUB standard nucleotide codes and their implementation in GCG/Staden. Online resources 4.2. Contig Assembly 4.2.1. What is a contig? 4.2.2. What are the contig assembly programs? 4.2.3. Workflow of contig assembly in GCG 4.2.4. Strategy for correct contig assembly Box. Workflow of contig assembly in GCG Online resources References Simon Lin, M.D. Duke Bioinformatics 919-681-9646 http://www.canctr.mc.duke.edu/bioinformatics Version: 10-18-1999
4.1 Introduction of DNA Sequencing The ABI sequencer (PE-Applied Biosystems, Foster City, CA) is one of the automatic DNA sequencers routinely used in molecular biology labs. All modern DNA sequencing relies on the Sanger method of DNA replication with dideoxy chain termination. The ABI sequencer utilizes a scanning laser to detect the fluorescence-labeled products as they electrophorese through a denaturing polyacrylamide gel. The signals collected are plotted as a series of color peaks representing the nucleotide sequence and are called chromatogram or electrophorogram. The process of transforming the chromatogram into sequence is referred to as base calling. Usually, the base calling process is done at the sequencing facility. However, under some special circumstance such as SNP detection, you might want use specialized programs to process the chromatogram yourself. Phrap and PloyPhrap are such base calling programs Usually, the results from the sequencing facility contain two kinds of files: the sequences and the chromatograms. The sequence file is in text format and can be easily viewed and analyzed in any bioinformatics programs, whereas a trace viewer is needed to view the chromatogram in ABI format. An example of the data produced by an automated sequencer. The peaks in different color for each base are read directly from left to right to determine the sequence. 4.1.1. Trace Viewer in GCG SeqLab To use the trace viewer in SeqLab, you should have an X-windows emulator. A free X- windows emulator can be obtained at the Duke OIT software library.
Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu.
Select the trace files in ABI or SCF format. In this example, 1012721 and 1012722 are trace files from and ABI sequencer, whereas 1012721.seq and 1012722.seq are the nucleotide sequence files, respectively. Click OK to import the sequence trace into the editor. Repeat this process until all traces files you want are imported, then click Cancel to exit the file selection box.
Open the traces window to display the traces. From the Windows menu, choose Traces. Then the trace viewer window should appear.
The trace viewer window and the editor window. IUB/GCG Meaning Complement Staden/Sanger A A T A C C G C G G C G T or U T A T M A or C K M R A or G Y R W A or T W W S C or G S S Y C or T R Y K G or T M K V A or C or G B V H A or C or T D H D A or G or T H D B C or G or T V B X/N G or A or T or C X/N N. or ~ gap character./~ - Table. IUB standard nucleotide codes and their implementation in GCG / Staden.
Online Reference Chapter 3. Working with the Trace Viewer. GCG SeqLab Guide http://www.canctr.mc.duke.edu/bioinformatics/gcg_documents/seqlab/03-working_with_trace_viewer.pdf X-Windows emulator From the Duke OIT software Library. http://www.oit.duke.edu/site/html/body_micro_x-win32.html 4.2. Introduction of Contig Assembly Contig assembly is a critical step in genome sequencing projects. It puts the jigsaws of fragmented sequences together. As the shortgun sequencing strategy being adopted in many sequencing projects, contig assembly became an area of more active research. Although you might not work on a large sequencing project, your knowledge of bioinformatics would not be complete if you do not know the basic concepts of sequence assembly. 4.2.1. What is a contig? Contig stands for contiguous sequence. It was first used by Staden (1980). Dr. Roger Staden is a pioneer in the study of fragment assembly. His work remains the basis of most sequence assembly programs nowadays. A contig is a collection of overlapped fragments, which includes an assembled consensus sequence for the entire group and the information of each individual sequence fragment. Contig assembly program will detect the overlap of many small sequence fragments and form a longer, contiguous consensus sequences. 4.2.2. Contig Assembly Programs TIGR Assembler Has been used in a number of megabase microbial genome projects at TIGR. Sutton G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1:9-19).
Gel Assemble Chapter V. DNA Sequencing and Contig Assembly http://www.tigr.org/softlab/ Phred/Phrap/Consed Including a base-caller, an assembler and an X-windows graphical interface, authored by Phil Green at Washington University. http://bozeman.genome.washington.edu/index.html Staden Package Complete package for sequencing, mutation detection, and sequence management. With X-windows interface. http://www.mrc-lmb.cam.ac.uk/pubseq/ (If you need any of the software above, please contact Simon Lin at 919-681-9646.) GCG Package Fragment Assembly System (FAS) in GCG. http://www.canctr.mc.duke.edu/bioinformatics/gcg_documents/gcg10_help_unix/contents/gelintroduction.html 4.2.3. Workflow of contig assembly in GCG GelStart GelEnter FAS Database /archive /working /consensus /relation Gel Merge Gel Disassemble GelView Workflow of contig assembly in GCG
After creates the fragment assemble project by gelstart, gelenter inputs the fragments into the FAS database. Gelmerge is the automatic contig assembler. Gelassemble is the post-assembly editing tool to resolve the ambiguities and conflicts in the automatic process by manual inspection. Gelview generates an overall view of the current status of the assembly project. Use common sense when you edit the contigs. Three kinds of errors can be corrected by manual editing: uncertainties, substitution errors, and frame shift errors. Frame shift error is more serious since it will completely change the deduced protein sequence from the position of error forward. This kind of error can usually be corrected by inspecting the chromatogram of alignment fragments. To get the consensus sequence of the contig, go to the command mode of gelassemble. The command prettyout writes an aligned output of fragments and the consensus similar to that of the GCG program pretty. If only the consensus sequence is needed, the command seqout can be used to generate the output. The program geldisassemble unmelds all assembled contigs in the current project and rebuilds a database consisting of the unjoined fragments. Although the details might vary, this workflow in GCG is generally applicable to all contig assembly software. 4.2.4. Hints and Common Mistakes You must run gelstart once to create or delete the sequence assembly project. And, you must run Gelstart once every time you wish to work on the project. GCG utilize a fragment assembly system (FAS) database to handle each assemble project. All files and directories under the project directory are in FAS database format. Do not add or delete files yourself in these directories. It will cause the FAS database corrupted. Use gelenter to add more fragments, and gelmerge/ gelassemble/ geldisassemble to modify the file contents. Remember, do not manipulate any file in the database with a UNIX text editor! You can manually resolve the discrepancies and correct the assembly errors by using gelassemble. You can also revise the errors in base calling of fragments if you have a graphical printout of ABI traces in hand. References Staden, R. 1980. A new computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Res. 8: 3673-3694.
Sutton G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1:9-19.