Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation. Beginning Perl, Chap 4 6. Example 1



Similar documents
( TUTORIAL. (July 2006)

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

(A) Microarray analysis was performed on ATM and MDM isolated from 4 obese donors.

10 µg lyophilized plasmid DNA (store lyophilized plasmid at 20 C)

Table S1. Related to Figure 4

Mutations and Genetic Variability. 1. What is occurring in the diagram below?

DNA Sample preparation and Submission Guidelines

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

The p53 MUTATION HANDBOOK

Inverse PCR & Cycle Sequencing of P Element Insertions for STS Generation

Supplementary Information. Binding region and interaction properties of sulfoquinovosylacylglycerol (SQAG) with human

Next Generation Sequencing

Hands on Simulation of Mutation

Gene Synthesis 191. Mutagenesis 194. Gene Cloning 196. AccuGeneBlock Service 198. Gene Synthesis FAQs 201. User Protocol 204

pcas-guide System Validation in Genome Editing

Inverse PCR and Sequencing of P-element, piggybac and Minos Insertion Sites in the Drosophila Gene Disruption Project

ANALYSIS OF A CIRCULAR CODE MODEL

Gene Finding CMSC 423

SERVICES CATALOGUE WITH SUBMISSION GUIDELINES

Molecular analyses of EGFR: mutation and amplification detection

pcmv6-neo Vector Application Guide Contents

Marine Biology DEC 2004; 146(1) : Copyright 2004 Springer

Y-chromosome haplotype distribution in Han Chinese populations and modern human origin in East Asians

Biopython Tutorial and Cookbook

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

Chapter 9. Applications of probability. 9.1 The genetic code

Module 6: Digital DNA

The making of The Genoma Music

The DNA-"Wave Biocomputer"

Heraeus Sepatech, Kendro Laboratory Products GmbH, Berlin. Becton Dickinson,Heidelberg. Biozym, Hessisch Oldendorf. Eppendorf, Hamburg

DNA Bracelets

Transmembrane Signaling in Chimeras of the E. coli Chemotaxis Receptors and Bacterial Class III Adenylyl Cyclases

Drosophila NK-homeobox genes

NimbleGen SeqCap EZ Library SR User s Guide Version 3.0

Characterization of cdna clones of the family of trypsin/a-amylase inhibitors (CM-proteins) in barley {Hordeum vulgare L.)

Five-minute cloning of Taq polymerase-amplified PCR products

Mutation. Mutation provides raw material to evolution. Different kinds of mutations have different effects

BD BaculoGold Baculovirus Expression System Innovative Solutions for Proteomics

All commonly-used expression vectors used in the Jia Lab contain the following multiple cloning site: BamHI EcoRI SmaI SalI XhoI_ NotI

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Insulin Receptor Gene Mutations in Iranian Patients with Type II Diabetes Mellitus

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Six Homeoproteins and a Iinc-RNA at the Fast MYH Locus Lock Fast Myofiber Terminal Phenotype

Event-specific Method for the Quantification of Maize MIR162 Using Real-time PCR. Protocol

Association of IGF1 and IGFBP3 polymorphisms with colorectal polyps and colorectal cancer risk

Archimer

TITRATION OF raav (VG) USING QUANTITATIVE REAL TIME PCR

inhibition of mitosis

BioTOP-Report. Biotech and Pharma in Berlin-Brandenburg

Protein Synthesis Simulation

Chlamydomonas adapted Green Fluorescent Protein (CrGFP)

BioTOP-Report. Biotech and Pharma in Berlin-Brandenburg

Problem Set 3 KEY

4 th DVFA Life Science Conference. Going East / Going West. Life Science Asia/Europe getting insight in mutual growth opportunities

Molecular Facts and Figures

On Covert Data Communication Channels Employing DNA Recombinant and Mutagenesis-based Steganographic Techniques

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

Anhang A: Primerliste. 1. Primer für RT-PCR-Analyse. 2. Allgemeine Klonierungsprimer


The nucleotide sequence of the gene for human protein C

pentr Directional TOPO Cloning Kits

How To Clone Into Pcdna 3.1/V5-His

TA Cloning Kit. Version V 7 April Catalog nos. K , K , K , K , K K , K , K

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

PROTOCOL: Illumina Paired-end Whole Exome Capture Library Preparation Using Full-length Index Adaptors and KAPA DNA Polymerase

Differentiation of Klebsiella pneumoniae and K. oxytoca by Multiplex Polymerase Chain Reaction

An Introduction to Bioinformatics Algorithms Gene Prediction

Immortalized epithelial cells from human autosomal dominant polycystic kidney cysts

Transcription:

Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation Beginning Perl, Chap 4 6 Example 1 #!/usr/bin/perl -w use strict; # version 1: my @nt = ('A', 'C', 'G', 'T'); for (1.. 100) { $dna.= $nt[int(rand 4)]; # version 4: with compositional bias my %comp = ( A => 0.20, C => 0.35, G => 0.25, T => 0.20 ); my @nt = keys %comp; my @bucket; # version 2: my @nt = qw(a C G T); for (1.. 100) { $dna.= $nt[int(rand 4)]; # version 3: most Perlish my @nt = qw(a C G T); for (1.. 100) { $dna.=$nt[int(rand(scalar @nt))]; print $dna, \n ; # fill the bucket to draw from: for my $nt (@nt) { push @bucket, ($nt) x ($comp{$nt * 100); for (0.. 99) { $dna.= $bucket[int(rand scalar @bucket)]; 1

Example 2 #!/usr/bin/perl -w use strict; # version 2: very Perlish; my @letters = ( A.. Z ); # version 1; # same as nt construction my @aa = qw(a C D E F G H I K L M N P Q R S T V W Y); my $protein; for (1.. 200) { $protein.= $aa[int(rand scalar @aa)]; print $protein, \n ; my %comp; # hash slice: @comp{@letters = ( 1 / 20 ) x @letters; # remove letter that aren't aa's $comp{'b' = $comp{'j' = $comp{'o' = $comp{'x' = $comp{'u' = $comp{'z' = 0; my @bucket; for my $l (@letters) { push @bucket ($l) x ($comp{$l * 100); my $length = 200 + rand(100); my $protein; for (1.. $length) { push $protein, $bucket[rand @bucket]; Example 3 #!/usr/bin/perl -w use strict; my %codontable = ( 'AAA' => 'K', 'AAC' => 'N', 'AAG' => 'K', 'AAT' => 'N', 'ACA' => 'T', 'ACC' => 'T', 'ACG' => 'T', 'ACT' => 'T', 'AGA' => 'R', 'AGC' => 'S', 'AGG' => 'R', 'AGT' => 'S', 'ATA' => 'I', 'ATC' => 'I', 'ATG' => 'M', 'ATT' => 'I', 'CAA' => 'Q', 'CAC' => 'H', 'CAG' => 'Q', 'CAT' => 'H', 'CCA' => 'P', 'CCC' => 'P', 'CCG' => 'P', 'CCT' => 'P', 'CGA' => 'R', 'CGC' => 'R', 'CGG' => 'R', 'CGT' => 'R', 'CTA' => 'L', 'CTC' => 'L', 'CTG' => 'L', 'CTT' => 'L', 'GAA' => 'E', 'GAC' => 'D', 'GAG' => 'E', 'GAT' => 'D', 'GCA' => 'A', 'GCC' => 'A', 'GCG' => 'A', 'GCT' => 'A', 'GGA' => 'G', 'GGC' => 'G', 'GGG' => 'G', 'GGT' => 'G', 'GTA' => 'V', 'GTC' => 'V', 'GTG' => 'V', 'GTT' => 'V', 'TAA' => '*', 'TAC' => 'Y', 'TAG' => '*', 'TAT' => 'Y', 'TCA' => 'S', 'TCC' => 'S', 'TCG' => 'S', 'TCT' => 'S', 'TGA' => '*', 'TGC' => 'C', 'TGG' => 'W', 'TGT' => 'C', 'TTA' => 'L', 'TTC' => 'F', 'TTG' => 'L', 'TTT' => 'F' ); # make DNA the boring way: my @nt = qw(a C G T); my $length = 500 + int(rand(500)); for (1.. $length) { $dna.= $nt[int(rand @nt)]; # OK, now translate: my $protein; my @seq = split('', $dna); my $codon = ''; for my $i (0.. $#seq) { $codon.= $seq[$i]; if (($i + 1) % 3 == 0) { $protein.= $codontable{$codon; $codon = ''; # another way, using "length()": for my $nt (@seq) { $codon.= $nt; if (length($codon) == 3) { $protein.= $codontable{$codon; $codon = ''; # yet another variation on the theme: while (@seq) { $codon.= shift @seq; if (length $codon == 3) { $protein.= $codontable{$codon; $codon = ''; # even better: while (@seq) { $protein.= $codontable{ shift @seq. shift @seq. shift @seq; 2

Input/Output open (INFILE, < gtm1_human.aa ); open file for reading $file = $ARGV[0]; open (INFILE, < $file ) die ( Cannot open $file ); open (OUTFILE, > $file); open file for writing $line = <FILE>; read a line chomp($line); remove \n while ($line=<file>) { print FILE $line\n ; close(file); Regular expressions >gi 121694 sp P20432 GTT1_DROME Glutathione S-transferase 1-1 used for string matching, substitution, pattern extraction /^>gi\ / matches >gi 121694 sp P20432 GTT1_DROME if ($line =~ m/^>gi/) { #match $line =~ /^>gi\ (\d+)\ /; # extract gi# $gi = $1; ($gi) = $line =~ /^>gi\ (\d+)\ /; #same $line =~ s/^>(.*)$/>>$1/; # substitution 3

Regular expressions (cont.) >gi 121694 sp P20432 GTT1_DROME Glutathione S-transferase 1-1 m/plaintext/ m/one two/; # alternation m/(one two) (three)/; # grouping with # parenthesis /^>gi\ (\d+)/ #beginning of line /.+ (\d+) aa$/ # end of line /a*bc/ # bc,abc,aabc, # repetitions /a?bc/ # abc, bc /a+bc/ # abc, aabc, Regular Expressions, III >gi 121694 sp P20432 GTT1_DROME Glutathione S-transferase 1-1 Matching classes: /^>gi\ [0-9]+\ [a-z]+\ [A-Z][0-9a-z]+\ / [a-z] [0-9] -> class [^a-z] -> negated class /^>gi\ \d+\ [a-z]\ \w+\ / \d -> number [0-9] \D -> not a number \w -> word [0-9A-Za-z_] \W -> not a word char \s -> space [ \t\n\r] \S -> not a space Capturing matches: /^>gi\ (\d+)\ ([a-z])\ (\w+)\ / $1 $2 $3 ($gi,$db,$db_acc) = $line =~ /^>gi\ (\d+)\ ([a-z])\ (\w+)\ /; 4

Regular expressions - modifiers m/that/i # ignore case s/this/that/g # global replacement m/>gi\ (\d+)\ ([a-z]{2,3 (\w+) #{range s///m # treat string as multiple lines s///s # span over \n s/\n//gs # remove \n in multiline entry s/gaattc/$1\n/g # break lines at EcoRI site { local $/ = "\n>"; # change the line separator to "\n" while ($entry = <INFILE>) { chomp($entry); $entry =~ s/\a>?/>/; # replace missing >, if necessary # \A is like ^ (beginning of string) open(out, ">file$num.fa") or die $!; $num++; print OUT "$entry\n"; close(out); String expressions (with regular expressions) if ( /^>gi\ / ) { if ( $line =~ m/^>gi\ /) { while ( $line!~ m/^>gi\ /) { Substitution: $new_line =~ s/\ /:/g; Pattern extraction: ($gi,$db,$db_acc) = $line = /^>gi\ (\d+)\ ([a-z])\ (\w+)\ /; split (/ /,$line) join ( :,@fields); substr($string,$start,$length); # rarely used index($string,$query); # rarely used Comparison: eq ne cmp lt gt le ge 5

use vars qw(%proteins %dnas $line $name $codon $aa); my ($pfile, $nfile) = @ARGV: open(protein, "<$pfile") or die $!; open(dna, "<$nfile") or die $!; while(my $line = <PROTEIN>) { chomp($line); if($line =~ m/^>/) { $name = shift split(' ', $line); next; $proteins{$name.= $line if $name; close(protein); for my $name (keys %proteins) { $proteins{$name =~ s/\s+//sg; while($line = <DNA>) { chomp($line); if($line =~ m/^>/) { $name =shift split(' ', $line); next; $dnas{$name.= $line if $name; close(dna); Perl mrtrans.pl for $name (keys %dna) { unless (exists $protein{$name) { warn "DNA sequence $name has no matching protein sequence!\n"; next; $dna{$name =~ s/\s+//sg; print "$name\n"; my @protein=split('', $proteins{$name); my @dna = split('', $dnas{$name); for $aa (@protein) { if($aa eq '-') { print "---"; else { $codon = shift @dna. shift @dna. shift @dna; # check that $codon codes for $aa print $codon; print "\n"; Exercises 1. Take a set of FASTA format files and make a FASTA library, combining those files 2. Take a FASTA library, and break it into individual FASTA files 3. Write an extraction program to report (a) the most significant library hit (b) all significant library hits that result from a FASTA search. Several example search results are available at watson.achs:/r0/seqlib/bioch508/perl/ fasta.result* 4. Write the same program for blast output. Sample results are in the same directory 5. Write a perl equivalent to mrtrans that does not care about the order of sequence input. Given two files, a FASTA protein library (with ʻ-ʼ for gaps) and a FASTA DNA library with the same sequence names, generate a FASTA DNA alignment with ʻ---ʼ inserted into the DNA sequence for each ʻ-ʼ in the protein sequence 6