Apply PERL to BioInformatics (II)



Similar documents
Internet Technologies

Biological Databases and Protein Sequence Analysis

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

CGI Programming. What is CGI?

07 Forms. 1 About Forms. 2 The FORM Tag. 1.1 Form Handlers

Further web design: HTML forms

Forms, CGI Objectives. HTML forms. Form example. Form example...

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Resources at a Glance

HTML Forms. Pat Morin COMP 2405

CD-HIT User s Guide. Last updated: April 5,

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

<option> eggs </option> <option> cheese </option> </select> </p> </form>

Viewing Form Results

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

HTML Forms and CONTROLS

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Genome Explorer For Comparative Genome Analysis

JavaScript and Dreamweaver Examples

Bio-Informatics Lectures. A Short Introduction

Pairwise Sequence Alignment

2- Forms and JavaScript Course: Developing web- based applica<ons

A Tutorial in Genetic Sequence Classification Tools and Techniques

Getting started in Bio::Perl 1) Simple script to get a sequence by Id and write to specified format

Real SQL Programming 1

XHTML Forms. Form syntax. Selection widgets. Submission method. Submission action. Radio buttons

HTML Tables. IT 3203 Introduction to Web Development

Introduction to XHTML. 2010, Robert K. Moniot 1

By Glenn Fleishman. WebSpy. Form and function

LabVIEW Internet Toolkit User Guide


HTML Form Widgets. Review: HTML Forms. Review: CGI Programs

PHP Form Handling. Prof. Jim Whitehead CMPS 183 Spring 2006 May 3, 2006

Perl/CGI. CS 299 Web Programming and Design

PHP Tutorial From beginner to master

10CS73:Web Programming

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

CS412 Interactive Lab Creating a Simple Web Form

GenBank, Entrez, & FASTA

Communication. Container-Exchange

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

BIOINFORMATICS TUTORIAL

FORM-ORIENTED DATA ENTRY

UGENE Quick Start Guide

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

CGI.pm Tutorial. Table of content. What is CGI? First Program

CSCI110: Examination information.

Introduction to GCG and SeqLab

People Data and the Web Forms and CGI. HTML forms. A user interface to CGI applications

Clone Manager. Getting Started

PDG Software. Site Design Guide

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

c. Write a JavaScript statement to print out as an alert box the value of the third Radio button (whether or not selected) in the second form.

Dynamic Web-Enabled Data Collection

Module 1. Sequence Formats and Retrieval. Charles Steward

Now that we have discussed some PHP background

7 Why Use Perl for CGI?

Web Development 1 A4 Project Description Web Architecture

InternetVista Web scenario documentation

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Using Form Tools (admin)

(A GUIDE for the Graphical User Interface (GUI) GDE)

Client-side Web Engineering From HTML to AJAX

Web Design and Development ACS Chapter 13. Using Forms 11/30/2015 1

IBM Tivoli Access Manager for e-business 6.1: Writing External Authentication Interface Servers

Novell Identity Manager

Lecture 2, Introduction to Python. Python Programming Language

Web Services for Management Perl Library VMware ESX Server 3.5, VMware ESX Server 3i version 3.5, and VMware VirtualCenter 2.5

TCP/IP Networking, Part 2: Web-Based Control

Designing and Implementing Forms 34

Version 5.0 Release Notes

InPost UK Limited GeoWidget Integration Guide Version 1.1

Sequence Database Administration

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Geneious Biomatters Ltd

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Intell-a-Keeper Reporting System Technical Programming Guide. Tracking your Bookings without going Nuts!

Inserting the Form Field In Dreamweaver 4, open a new or existing page. From the Insert menu choose Form.

Regular Expressions and Pattern Matching

EUROPEAN COMPUTER DRIVING LICENCE / INTERNATIONAL COMPUTER DRIVING LICENCE WEB EDITING

Tutorial 6 Creating a Web Form. HTML and CSS 6 TH EDITION

JavaScript: Arrays Pearson Education, Inc. All rights reserved.

JAVASCRIPT AND COOKIES

Dynamic Content. Dynamic Web Content: HTML Forms CGI Web Servers and HTTP

Molecular Databases and Tools

Configuring iplanet 6.0 Web Server For SSL and non-ssl Redirect

Working with forms in PHP

Web Programming with PHP 5. The right tool for the right job.

Transcription:

Apply PERL to BioInformatics (II) Lecture Note for Computational Biology 1 (LSM 5191) Jiren Wang http://www.bii.a-star.edu.sg/~jiren BioInformatics Institute Singapore

Outline Some examples for manipulating biological sequence data. Create Graphical User Interface to BLAST Introduction to BioPerl

Concatenating DNA Fragments #!/usr/bin/perl w # Calculating the reverse complement of a strand of DNA $DNA1 = AGCGGGCCAGACGCAGCCGAGTATCTT ; $DNA2 = GTGGAGTTCCGCCGCTGTTCTTCT ; $DNA3 = $DNA1. $DNA2; print The concatenation of the two fragments is: ; print $DNA3\n\n ;

Extracting a Subsequence Function substr Three arguments: a string value, a start position, and a length. #!/usr/bin/perl w $sequence = MVLERILLQTIKFDLQVEHPYQFLLKYAKQLK. GDKNKIQKLVQMAWTFVNDSLCTTLSLQWEPE. IIAVAVMYLAGRLCKFEIQEWTSKP ; $start = 8; $length = 10; $subsequence = substr($sequence, $start, $length); print The subsequence is: $subsequence\n\n ; # The subsequence is: QTIKFDLQVE

Reversing Complement of Sequence #!/usr/bin/perl w # Calculating the reverse complement of a strand of DNA $DNA = ACGGGAGCGGAAAATTACGGCATTAGC ; $revcom = reverse $DNA; # $revcom =~ s/a/t/g; # $revcom =~ s/t/a/g; # $revcom =~ s/g/c/g; # $revcom =~ s/c/g/g; $revcom =~ tr/acgtacgt/tgcatgca/;

Input from STDIN Filehandle A filehandle is just a name you give to a file, device, socket, or pipe to help you remember which one you are talking about. STDIN in Perl is a filehandle for standard input. chomp function removes all trailing newlines from string(s) in paragraph mode ($/ is empty). $a = <STDIN>; @a = <STDIN>;

Motifs Motifs are short segments of DNA or protein that are of particular interest. Looking for motifs are one of the most common things in bioinformatics. Motifs may be regulatory elements of DNA or short stretches of protein that are known to conserved across many species. Motifs can often be represented as regular expressions.

Finding Motifs #!/usr/bin/perl w print Enter the sequence file name: ; $filename = <STDIN>; chomp $filename; open(in, $filename ) die Cannot open file. ; @seqs = <IN>; close IN; $seq = join(,@seqs); $seq =~ s/\s//g; Print Enter the motif: ; $motif = <STDIN>; chomp $motif; if ($seq =~ /$motif/) { print The motif $motif has been found.\n ; } else { print The motif $motif hasn t been found.\n ; }

Stand-alone BLAST Steps for downloading and installing standalone BLAST executables $ ftp ftp.ncbi.nih.gov Name (ftp.ncbi.nih.gov:username): anonymous Password: anonymous ftp>cd blast/executables ftp>bin ftp>get blast.linux.tar.z ftp>quit $gunzip blast.linux.tar.z $tar xvf blast.linux.tar

Program blastall Some arguments -p Program Name [String] -d Database [String] default = nr -i Query File [File In] -e Expectation value (E) [Real] default = 10.0 -T Produce HTML output [T/F] default = F

Table of BLAST Programs Program blastp blastn blastx tblastn tblastx Description compares an amino acid query sequence against a protein sequence database. compares a nucleotide query sequence against a nucleotide sequence database. compares a nucleotide query sequence translated in all reading frames against a protein sequence database. compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Program formatdb Some arguments -i Input file(s) for formatting -p Type of file T protein, F nucleotide. -o Parse options: T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. -n Base name for BLAST files [String] formatdb -i testdb.txt p F o T n testdb

Package and Module Package A package is simply a set of Perl statements assigned to a user defined name space. Package declaration: package NAMESPACE Module A module is just a reusable package that is defined in a library file whose name is the same as the name of the package (with a.pm on the end). Include Perl modules in your program: use Module.

Introduction to CGI The CGI.pm module CGI Common Gateway Interface CGI.pm is a module for programming interactive web pages. Its functions are geared toward formatting web pages and creating and processing forms in which user enter information.

How a CGI Application Is Executed?

Forms and CGI HTML forms Provide input to your CGI scripts. The <form> tag <form action= /cgi-bin/test.cgi method= post > </form>

Quick Reference to Form Tags <FORM ACTION= /cgi-bin/test.cgi METHOD= POST <INPUT TYPE= text NAME= name VALUE= value SIZE= size > <INPUT TYPE= password NAME= name VALUE=value SIZE= size > <INPUT TYPE= hidden NAME= name VALUE= value > <INPUT TYPE= checkbox NAME= name VALUE= value > <INPUT TYPE= radio NAME= name VALUE= value > <SELECT NAME= name SIZE=1> <OPTION SELECTED>Item One</OPTION> <OPTION>Item Two</OPTION> </SELECT> <TEXTAREA ROWS=yy COLS=xx NAME= name > </TEXTAREA> <INPUT TYPE= submit NAME= name VALUE= value > <INPUT TYPE= reset VALUE= value > </FORM> Form Tag Description Start the form Text field Password field Hidden field Checkbox Radio button Menu (drop-down) Multiline text field Submit button Reset button End the form

Form Action and Method Action Specifies the URL of the CGI script that should receive the HTTP request made by the CGI script. HTTP HyperText Transfer Protocol. Method Specifies the HTTP request method used when calling the CGI script. GET is the standard request method for retrieving a document via HTTP on the Web. POST is used with HTML forms to submit information that alters data stored on the web server.

Example of HTML <HTML> <TITLE>BLAST Search </TITLE> <H1><CENTER>BLAST (@BII)</CENTER></H1> <FORM ACTION="/cgi-bin/blast/blast.pl" METHOD = POST NAME="MainBlastForm" ENCTYPE= "multipart/form-data"> <B>Choose program to use and database to search:</b> <P> <a href="docs/blast_program.html">program</a> <select name = "PROGRAM"> <option> blastn <option> blastp <option> blastx <option> tblastn <option> tblastx </select> <a href="docs/blast_databases.html">database</a>

Example of HTML (cont d) <select name = "DATALIB"> <option VALUE = "nr"> nr <option VALUE = "swissprot"> swissprot <option VALUE = "pat"> pat <option VALUE = "yeast"> yeast <option VALUE = "ecoli"> ecoli <option VALUE = "pdb"> pdb <option VALUE = "drosoph"> Drosophila genome <option VALUE = "month"> month <option VALUE = "est"> est <option VALUE = "est_human"> est_human <option VALUE = "est_mouse"> est_mouse <option VALUE = "est_others"> est_others <option VALUE = "htg"> htgs <option VALUE = "gss"> gss <option VALUE = "sts"> dbsts <option VALUE = "mito"> mito <option VALUE = "vector"> vector </select>

Example of HTML (cont d) <P> Enter sequence below in <a href="docs/fasta.html">fasta</a> format <BR> <textarea name="sequence" rows=6 cols=60> </textarea> <BR> Set subsequence From: <INPUT TYPE="text" NAME="FROM" SIZE=6> To: <INPUT TYPE="text" NAME="TO" SIZE=6> <P> <INPUT TYPE="button" VALUE="Clear sequence" onclick="mainblastform.sequence.value='';mainblastform.sequence.focus();"> <INPUT TYPE="reset" VALUE="Reset"> <INPUT TYPE="submit" VALUE="Search"> </FORM> </HTML> Note: focus() moving cursor back to a field.

Screenshot of GUI for BLAST

Process Filehandle Using processes as filehandles Create a process-filehandle that either captures the output from the process or provides input to the process. Open a process-filehandle for reading by putting the vertical bar on the right of the command. #!/usr/bin/perl open(lsproc, "ls "); @lsoutput = <LSPROC>; close(lsproc); print @lsoutput;

CGI Program #!/usr/bin/perl -w use strict; use CGI; my ($form, $program, $datalib, $sequence, $from, $to); my ($length); my (@lines); my ($state, $line, $title, $seq); my ($seqfilename, $options); print "Content-type: text/html\n\n"; $form = new CGI; $program = $form->param('program'); $datalib = $form->param('datalib'); $sequence = $form->param('sequence'); $from = $form->param('from'); $to = $form->param('to');

CGI Program (cont d) if ($sequence ne ) { $state = 0; $seq = ""; my (@lines) = split /\n/, $sequence; foreach $line (@lines) { if ($line =~ /^>/) { $state++; if ($state == 1) { $title = $line; $title =~ s/^[\s]+//; $title =~ s/[\s]+$//; } } else { if ($state == 1) { $seq.= $line; } } } $seq =~ s/[\s]+//g; $seq =~ tr/a-z/a-z/; $length = length($seq);

CGI Program (cont d) if (($to > $from) && ($from < $length) && ($to > 1)) { if ($from < 0) { $from = 0; } if ($to > $length) { $to = $length; } $seq = substr($seq, $from - 1, $to - $from + 1); } $seqfilename = /tmp/seq ; open(out, ">$seqfilename"); print OUT "$title\n$seq"; close(out); $options = "-T -p $program -d /ncbi/blast/db/$datalib -i $seqfilename"; open(in, /ncbi/blast/executables/blastall $options "); while(<in>) { print $_; } close(in); } else { print Please enter the query sequence in FASTA format.<br> ; }

Introduction to BioPerl BioPerl An important collection of Perl code for bioinformatics that has been in development since 1998. Dedicate to the creation of an open source Perl tools for bioinformatics, genomics and life science research.

The Main Focus of Bioperl Modules Perform sequence manipulation. Provide access to various databases (both local and web-based). Parsing of the results of various molecular biology programs.

Modules for Typical Tasks Accessing sequence data from local and remote database. Transforming formats of database / file records. Manipulating individual sequences. Searching for "similar" sequences. Creating and manipulating sequence alignments. Searching for genes and other structures on genomic DNA. Developing machine readable sequence annotations.

PERL s Objects An object is simply a referenced thingy that happens know which class it belongs to. A class is simply a package that happens to provide methods to deal with objects. A method is simply a subroutine that expects an object reference as its first argument.

Method Invocation The object-oriented syntax form looks like this: CLASS_OR_INSTANCE->METHOD(LIST)

Sequence File seqa.fasta > seq1 MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFLGRWFSAGLAS NSSWLREKKAALSMCKSVVAPATDGGLNLTSTFLRKNQCETRTMLLQPAG SLGSYSYRSPHWGSTYSVSVVETDYDQYALLYSQGSKGPGEDFRMATLYS RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ > seq2 APEAQVSVQPNFQPDKFLGRWFSAGLASNSSWLQEKKAALSMCKSVVAPA ADGGFNLTSTFLRKNQCETRTMLLQPGDSLGSYSYRSPHWGSTYSVSVVE TDYDHYALLYSQGSKGPGEDFRMATLYSRTQTPRAELKEKFTAFCKAQGF TEDSIVFLPQTDKCMTEQ

Bio::SeqIO Bio::SeqIO is a handler module for the formats in the SeqIO set (eg, Bio::SeqIO::fasta). #!/usr/bin/perl -w use Bio::SeqIO; $str = Bio::SeqIO->new( -file=>'seqa.fasta', '-format' => 'Fasta'); $input = $str->next_seq(); $input2 = $str->next_seq(); print "Sequence 1\n"; print $input->id, "\n"; print $input->seq, "\n\n"; print "Sequence 2\n"; print $input2->id, "\n"; print $input2->seq, "\n\n";

Example: Sequence Data Format Transfer #!/usr/bin/perl -w use Bio::SeqIO; use strict; my $in = Bio::SeqIO->new('-file' => "seqa.fasta", '-format' => 'Fasta'); my $out = Bio::SeqIO->new('-file' => ">seqa.embl", '-format' => 'EMBL'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

Example: Sequence Data Format Transfer (cont d) Output result: ID seq1 standard; AA; UNK; 190 BP. XX AC unknown; XX DE XX FH Key Location/Qualifiers FH XX SQ Sequence 190 BP; 17 A; 4 C; 14 G; 17 T; 138 other; mathhtlwmg lallgvlgdl qaapeaqvsv qpnfqqdkfl grwfsaglas nsswlrekka 60 alsmcksvva patdgglnlt stflrknqce trtmllqpag slgsysyrsp hwgstysvsv 120 vetdydqyal lysqgskgpg edfrmatlys rtqtpraelk ekftafckaq gftedtivfl 180 pqtdkcmteq 190 // ID seq2 standard; AA; UNK; 168 BP. XX AC unknown; XX DE XX FH Key Location/Qualifiers FH XX SQ Sequence 168 BP; 14 A; 4 C; 11 G; 13 T; 126 other; apeaqvsvqp nfqpdkflgr wfsaglasns swlqekkaal smcksvvapa adggfnltst 60 flrknqcetr tmllqpgdsl gsysyrsphw gstysvsvve tdydhyally sqgskgpged 120 frmatlysrt qtpraelkek ftafckaqgf tedsivflpq tdkcmteq 168 //

Bio::Tools::Run::StandAloneBlast #!/usr/bin/perl -w use Bio::Tools::Run::StandAloneBlast; # use the following array to input parameter @params = ('d' => 'swissprot', 'o' => 'blast.out', '_READMETHOD' => 'Blast', 'T' => 'F', 'p' => 'blastp' ); $factory = Bio::Tools::Run::StandAloneBlast->new(@params); # Blast a sequence against a database: $str = Bio::SeqIO->new( -file=>'seqa.fasta', '-format' => 'Fasta'); $input = $str->next_seq(); print "Generate the searching result now: \n"; $blast_report = $factory->blastall($input);

Input Sequence for ClustalW >seq1 MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFLGRWFSAGLAS NSSWLREKKAALSMCKSVVAPATDGGLNLTSTFLRKNQCETRTMLLQPAG SLGSYSYRSPHWGSTYSVSVVETDYDQYALLYSQGSKGPGEDFRMATLYS RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ >seq2 APEAQVSVQPNFQPDKFLGRWFSAGLASNSSWQEKKAALSMCKSVVAPA ADGGFNLTSTFLKNQCETRTMLLQPGDSLGSYSYRSPHWGSTYSVSVVE TDYDHYALLYSQGSKGPGEDFRMATLYSRTQTPRAELKEKFAFCKAQGF TEDSIVFLPQTDKCMTEQ >seq3 MAALRMLWMGLVLLGLLGFPQTPAQGHDTVQPNFQQDKFLGRWYSAGLAS NSSWFREKKAVLYMCKTVVAPSTEGGLNLTSTFLRKNQCETKIMVLQPAG APGHYTYSSPHSGSIHSVSVVEANYDEYALLFSRGTKGPGQDFRMATLYS RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE

Bio::Tools::Run::Alignment::Clustalw #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Alignment::Clustalw; my $inputfilename = 'seq.txt'; my $outfile = "clustalw_result.txt"; # Build a clustalw alignment factory my @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile'=> $outfile); my $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params); my $clustalw_location = $factory->program("/home/webuser/clustalw1.82/clustalw"); print ("\nthe location of clustalw is $clustalw_location.\n"); # Pass the factory a list of sequences to be aligned. my $aln = $factory->align($inputfilename); # $aln is a SimpleAlign object. print " length = ", $aln->length, "\n"; print " no_residues = ", $aln->no_residues, "\n"; print " no_sequences = ", $aln->no_sequences, "\n"; print " percentage_identity = ", $aln->percentage_identity, "\n";

Output on Screen The location of clustalw is /home/webuser/clustalw1.82/clustalw. CLUSTAL W (1.82) Multiple Sequence Alignments Sequence format is Pearson Sequence 1: seq1 190 aa Sequence 2: seq2 165 aa Sequence 3: seq3 189 aa Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 95 Sequences (1:3) Aligned. Score: 72 Sequences (2:3) Aligned. Score: 70 Guide tree file created: [./seq.dnd] Start of Multiple Alignment There are 2 groups Aligning... Group 1: Sequences: 2 Score:2467 Group 2: Sequences: 3 Score:2701 Alignment Score 2488 GCG-Alignment file created [clustalw_result.txt] length = 190 no_residues = 26 no_sequences = 3 percentage_identity = 78.3783783783784

Output: clustalw_result.txt PileUp MSF: 190 Type: P Check: 4001.. Name: seq1 oo Len: 190 Check: 5441 Weight: 23.3 Name: seq2 oo Len: 190 Check: 3046 Weight: 30.0 Name: seq3 oo Len: 190 Check: 5514 Weight: 46.6 // seq1 MATHHTLWMG LALLGVLGDL QAAPEAQVSV QPNFQQDKFL GRWFSAGLAS seq2........apeaqvsv QPNFQPDKFL GRWFSAGLAS seq3 MAALRMLWMG LVLLGLLGFP QTPAQGHDTV QPNFQQDKFL GRWYSAGLAS seq1 NSSWLREKKA ALSMCKSVVA PATDGGLNLT STFLRKNQCE TRTMLLQPAG seq2 NSSWQ.EKKA ALSMCKSVVA PAADGGFNLT STFLKN.QCE TRTMLLQPGD seq3 NSSWFREKKA VLYMCKTVVA PSTEGGLNLT STFLRKNQCE TKIMVLQPAG seq1 SLGSYSYRSP HWGSTYSVSV VETDYDQYAL LYSQGSKGPG EDFRMATLYS seq2 SLGSYSYRSP HWGSTYSVSV VETDYDHYAL LYSQGSKGPG EDFRMATLYS seq3 APGHYTYSSP HSGSIHSVSV VEANYDEYAL LFSRGTKGPG QDFRMATLYS seq1 RTQTPRAELK EKFTAFCKAQ GFTEDTIVFL PQTDKCMTEQ seq2 RTQTPRAELK EKF.AFCKAQ GFTEDSIVFL PQTDKCMTEQ seq3 RTQTLKDELK EKFTTFSKAQ GLTEEDIVFL PQPDKCIQE.