DNA Sequence Analysis Two general kinds of analysis Screen for one of a set of known sequences Determine the sequence even if it is novel Screening for a known sequence usually involves an oligonucleotide probe. Species-specific PCR primers. Fluorescent probes for real-time PCR. Fixed probes for binding PCR product to a membrane (e.g. a dot blot system) or a microarray chip. 1
Medical scientist now routinely use this sort of approach to screen for known genotypes (e.g., an allele associated with a disease). The more high-tech methods they have developed are starting to diffuse into forensic science Microarrays Silicon microchips can be constructed with a huge array (thousands) of attached probes. Hybridization can be enhanced (template molecules clamped on or kicked off) electronically. Detection is usually optical, e.g., fluorescentlylabeled PCR product. 2
The microarray has become a standard tool of medical and other biological research. A common use is to study gene expression using reverse transcriptase PCR product. Another screening approach: link to real-time measurement of PCR amplification. [Recall DNA extract quantitation by real time PCR. Also note potential confusion with acronym RT- PCR.] 3
These are methods for detecting a polymorphism within an amplicon. The signal that hybridization has occurred is a function of the amount of PCR product that has been produced. This is measured while the reaction cycles are taking place (real time) or when the program is complete. An exonuclease probe relies on the 5 nuclease activity of Taq. The signal of the attached fluorophore cannot be detected in close proximity to the quencher. They are separated when Taq chews up the probe. EXAMPLE: Various TaqMan products sold by Applied Biosystems A molecular beacon acts in a similar manner, however the probe is not destroyed 4
The oligonucleotide probe is designed to have a higher annealing temperature than the primers. Experimental conditions are such that the probe binds only if it is a perfect match. The amount of fluorescence is a function of the amount of PCR product that has been produced. The fluorescent agent can be one of the probe types just described. This makes it possible to do a multiplex reaction. A simpler alternative is to use a specific primer assay, i.e. the primer annealing site covers the polymorphic base position, and the primer will only bind if the allele of interest is present. For this one could use a general purpose dye such as SYBR green that only fluoresces when bound to dsdna. 5
The amount of product reaches the same plateau for a range of initial DNA copy numbers. Applied Biosystems Quantifiler kit. R n = normalized reporter signal Determination of a potentially novel allele sequence There are a variety of methods that involve information about the final nucleotide on each of a series of fragments separated according to length. 6
Cycle sequencing A process that resembles PCR (a primer, repeated temperature cycles, thermostable polymerase, etc.). One big difference is that only one primer is used therefore only one strand is copied. This produces an arithmetic rather than an exponential increase in product. Therefore PCR product is used as the template. If only a (secret) fraction of nucleotides are ddntps labeled with a dye, then all possible lengths of DNA copies will be made. Processed and electrophoresed molecules, with 3 nucleotide color-coded Template (actually the complement strand) Primer + AGTTTTGGCTCGAACACGTCACAGCCTTTAA AGTTTTGGCTCGAACACGTCACAGCCTTTAA AGTTTTGGCTCGAACACGTCACAGCCTTTA AGTTTTGGCTCGAACACGTCACAGCCTTT AGTTTTGGCTCGAACACGTCACAGCCTT AGTTTTGGCTCGAACACGTCACAGCCT AGTTTTGGCTCGAACACGTCACAGCC AGTTTTGGCTCGAACACGTCACAGC AGTTTTGGCTCGAACACGTCACAG AGTTTTGGCTCGAACACGTCACA AGTTTTGGCTCGAACACGTCAC AGTTTTGGCTCGAACACGTCA AGTTTTGGCTCGAACACGTC AGTTTTGGCTCGAACACGT etc. electropherogram-> 7
This can be done separately for each strand of the PCR product, e.g. forward and reverse. 377 sequencing gel image The standard cycle-sequencing reagent kit is one of the BigDye series sold by Applied Biosystems. 8
Applied Biosystems 310 instrument raw signal, before applying the dye matrix. same 310 signal following matrix analysis From commercial literature (Applied Biosystems 00106187.pdf) selling an improved sequencing chemistry kit. Short repeats make for a difficult template. 9
In the old days, before fancy colors, the only way to do this was with four separate sequencing reactions, each with only one kind of ddntp. Technical aspects of the cyclesequencing electropherogram. The analysis software (usually): -applies a spectral calibration -determines approximate peak spacing -calls the color (=base) according to spacing expectation -returns an N if too little signal or conflicting signal 10
Sequencing requires no size standard. One simply notes the relative position of the peaks. The mobility files are software calibration settings that account for the fact that eletrophoretic mobility is a function of DNA fragment length, instrument parameters (e.g. polymer type) and the type of dye. www.udel.edu/dnasequence/udsgc/interpreting%20electropherograms.html Even a clean sequence data file will start with a few bases that are poorly resolved (peaks shifted and too broad). This is actually a mobility calibration problem. Do not expect to infer the sequence immediately adjacent to your sequencing primer. 11
Part of the initial mess can be signal from unincorporated ddntps (from www.biosci.ohio-state.edu/~pmgf/) This probably resulted from poor cleaning of the CS product (a process to remove small molecules). The underlying peaks might be manually called. www.udel.edu/dnasequence/udsgc/interpreting%20electropherograms.html Peaks in the middle should be sharp, not much different in height, evenly spaced with little overlap, and have little underlying noise (signal from colors other than the major peak). 12
www.udel.edu/dnasequence/udsgc/interpreting%20electropherograms.html Toward the downstream end, peaks will become broad and irregular in shape. In part this is simply a statistical phenomenon. Longer electrophoresis time means that random differences accumulate. However this doesn t explain everything, and this pattern is not completely understood. If the PCR product was short, this may affect only a few bases. Other potential problems. Underlying (minor) peaks. -if out of phase with the major peaks, it could be a variety of problems. -If in phase with the major peaks, it could be a poor spectral calibration, or reflect a mixture a sequences in your PCR product. 13
A homomeric stretch will cause strand slippage (during PCR, CS, or cell replication). Downstream of the position where 2 or more sequences are out of phase the electropherogram rapidly decays. Sequence toward the stretch from both directions. It may be that the homomeric stretch produced this replication error in the cell. (A) 16189T Good quality sequence (B) Poor quality sequence (two length variants out of phase) HV1 C-stretch (C) Primer strategies typically used with C-stretch containing samples C-stretch C-stretch Use of internal primers Double reactions from the same strand Figure 10.7, J.M. Butler (2005) Forensic DNA Typing, 2 nd Edition 2005 Elsevier Academic Press A so-called strong stop, in which peak height is suddenly reduced, is thought to result from template secondary structure inhibiting the polymerase. Sequence from both directions. 14
Too much sequencing product, sometimes a result of adding too much PCR product to the reaction, can lead to off-scale peaks. This can make it difficult to distinguish adjacent peaks of the same color. This problem may be unavoidable when sequencing a short amplicon, such as may be necessary for a degraded sample. Initially blobs of signal from unincorporated ddntps. Downstream just low level noise. The CS reaction was a failure. Do over. Sequence electropherogram software often includes a peak quality score generated by the program Phred* Probability that the Phred quality score base is called wrong 10 1 in 10 Common threshold for 20 1 in 100 accepting a base call 30 1 in 1,000 40 1 in 10,000 50 1 in 100,000 *Ewing et al. 1998. Genome Res. 8:175-185. 15
en.wikipedia.org/wiki/file:phred_figure_1.jpg 16
Common cycle-sequence product electrophoresis technology is limited to a maximum read of about 1000 bp. This requires a longer capillary or acrylamide gel than is needed for STR analysis, and the electrophoresis time is longer as well. For example, with the 3100 line of Applied Biosystems genetic Analyzers, STR analysis works best* with a 36 cm capillary (~40 min), and sequencing with a 50 cm capillary (~2 hr). *There is some flexibility, i.e. one can get roughly 400 bases of sequence using a 36 cm capillary. Large scale sequencing: Genome projects. The availability of published whole genomes allows one, including forensic biologists, to search with a computer for new loci. 17
TO REPEAT... Common cycle-sequence product electrophoresis technology is limited to a maximum read of about 1000 bp. Genomes are huge, therefore a project to determine the entire sequence must somehow be obtained in short segments. The individual, overlapping, sequences are then assembled into the final contiguous sequence using a lot of computer power. This is done several ways. The first widespread method was Shotgun sequencing The DNA is fragmented with restriction enzymes. Fragments up to 150KB are then cloned. 18
www.premedcentral.net Cloning a segment of DNA in a bacterial cell is relatively easy because a circular plasmid will be replicated by the cell. From NCBI web site. The fragments are ligated to a cloning vector DNA fragment, inserted in a bacterial cell for amplification (by cell division), then sequenced from the ends of the vector. Only ~1000 bp obtained in each direction. dimer.tamu.edu/young/genomics/images/index/image013.jpg Contig is the jargon term for the assembled sequence. 19
There are ambiguities in the data, because of both sequencing and assembly errors. The quality of a genome data set is indicated by the number of independent times each bases call has been replicated, i.e. 8X. The technology has advanced quickly. Speed is up and cost is down. The first draft of the human genome (2001) cost $300M, $3B for the current draft. Jan 2006, report of rhesus monkey genome for $22M. Right now we seem to be at ~$15K for a human-size genome. The Holy Grail is the $1000 genome. A couple of new technologies for massive sequencing have burst on the scene. One is based on pyrosequencing, and is sold by the 454 Life Sciences company. 20
Figures from454 web site. Each ssdna fragment with adaptor/primer site, is individually attached to a bead. The bead is isolated in a drop of oil floating in a aqueous solution. Each drop contains the reagents for PCR, so the result is a huge number of independent emulsion PCR (empcr) reactions in the same tube. The resulting PCR product sticks to the same bead. The DNA enriched beads are centrifuged into individual wells on a plate, and covered with smaller beads carrying the enzymes needed for pyrosequencing. 200,000 pyrosequencing reactions are detected at the same time. Supposedly millions of bases of sequencing data can be generated per hour, and one person could do a complete bacterial genome in a few days. 21
So what is pyrosequencing? PCR product is used as template for a sequencing reaction that requires only ~10 minutes (for short sequences) once the instrument is loaded. One primer is biotinylated for separation of one stand for the next step. The addition of each nucleotide during extension involves the release of a pyrophosphate. This is turn triggers a cascade of reactions that include a fluorescent step. The four different nucleotides are injected and then flushed from the reaction chamber in sequence. http://www.pyrosequencing.com/graphics/3341.gif A pyrogram. http://www.ercim.org/publication/ercim_news/enw60/carlsson2.png Although reads of >70 bp are reported, until recently data quality often seemed to drop after about 20 bp. 22
Pyrosequencing is great for large genotype screening projects, but it s less attractive if a long read is needed. Also, this technique does not handle mixtures well. 23