Regular Expressions and Pattern Matching james.wasmuth@ed.ac.uk

Similar documents

Lecture 18 Regular Expressions

Regular Expressions. In This Appendix

Using Regular Expressions in Oracle

Regular Expressions. General Concepts About Regular Expressions

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

Bioinformatics Resources at a Glance

Lecture 4. Regular Expressions grep and sed intro

Regular Expression Syntax

Apply PERL to BioInformatics (II)

University Convocation. IT 3203 Introduction to Web Development. Pattern Matching. Why Match Patterns? The Search Method. The Replace Method

DNA Sequence formats

The C++ Language. Loops. ! Recall that a loop is another of the four basic programming language structures

Regular Expression Searching

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Advanced Bash Scripting. Joshua Malone

Learn Perl by Example - Perl Handbook for Beginners - Basics of Perl Scripting Language

Unix Shell Scripts. Contents. 1 Introduction. Norman Matloff. July 30, Introduction 1. 2 Invoking Shell Scripts 2

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Introduction to Genome Annotation

Biological Sequence Data Formats

Kiwi Log Viewer. A Freeware Log Viewer for Windows. by SolarWinds, Inc.

GenBank, Entrez, & FASTA

Computer Programming In QBasic

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

1 Description of The Simpletron

Excel: Introduction to Formulas

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Using PRX to Search and Replace Patterns in Text Strings

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

Programming Languages CIS 443

Regular Expressions. Abstract

CS 1133, LAB 2: FUNCTIONS AND TESTING

Sequence Database Administration

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

RAST Automated Analysis. What is RAST for?

Hands-On UNIX Exercise:

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Importing and Exporting With SPSS for Windows 17 TUT 117

Basic C Shell. helpdesk@stat.rice.edu. 11th August 2003

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Content of this lecture. Regular Expressions in Java. Hello, world! In Java. Programming in Java

Introduction to Searching with Regular Expressions

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

MULTIPLICATION AND DIVISION OF REAL NUMBERS In this section we will complete the study of the four basic operations with real numbers.

PHP Tutorial From beginner to master

Analyzing A DNA Sequence Chromatogram

The Center for Teaching, Learning, & Technology

1. To start Installation: To install the reporting tool, copy the entire contents of the zip file to a directory of your choice. Run the exe.

MyOra 3.5. User Guide. SQL Tool for Oracle. Kris Murthy

Programming in Perl CSCI-2962 Final Exam

Appendix K Introduction to Microsoft Visual C++ 6.0

7 Why Use Perl for CGI?

Regular Expressions (in Python)

Qlik REST Connector Installation and User Guide

A Crash Course on UNIX

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Chapter 2: Elements of Java

Access Control and Audit Trail Software

DigitalPersona. Password Manager Pro. Version 5.0. Administrator Guide

Regular Expressions. The Complete Tutorial. Jan Goyvaerts

HTML Codes - Characters and symbols

CHARGE Anywhere. Mobile POS. User s Guide

Calculate Highest Common Factors(HCFs) & Least Common Multiples(LCMs) NA1

App Building Guidelines

Secrets of printf. 1 Background. 2 Simple Printing. Professor Don Colton. Brigham Young University Hawaii. 2.1 Naturally Special Characters

GOOGLE DOCS APPLICATION WORK WITH GOOGLE DOCUMENTS

UNIX, Shell Scripting and Perl Introduction

URL encoding uses hex code prefixed by %. Quoted Printable encoding uses hex code prefixed by =.

Microsoft Dynamics GP. SmartList Builder User s Guide With Excel Report Builder

Perl/CGI. CS 299 Web Programming and Design

Message Archiving User Guide

Searching Nucleotide Databases

Exercises for Design of Test Cases

BIGPOND ONLINE STORAGE USER GUIDE Issue August 2005

Excel for Mac Text Functions

Deposit Direct. Getting Started Guide

Pemrograman Dasar. Basic Elements Of Java

Create a survey using Google Forms

Using Mail Merge in Microsoft Word 2003

Personal Portfolios on Blackboard

Prepare your result file for input into SPSS

FirstClass FAQ's An item is missing from my FirstClass desktop

Employment intermediaries: data requirements for software developers

Section 1.4 Place Value Systems of Numeration in Other Bases

PPUM icare SINGLE SIGN ON

Integrated Accounting System for Mac OS X

Version August 2016

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Education Solutions Development, Inc. APECS Navigation: Business Systems Getting Started Reference Guide

The Settings tab: Check Uncheck Uncheck

Umbraco v4 Editors Manual

C&A AR Online Credit Card Processor Installation and Setup Instructions with Process Flow

TCP/IP Networking, Part 2: Web-Based Control

Transcription:

Regular Expressions and Pattern Matching james.wasmuth@ed.ac.uk Regular Expression (regex): a separate language, allowing the construction of patterns. used in most programming languages. very powerful in Perl. Pattern Match: using regex to search data and look for a match. Overview: how to create regular expressions how to use them to match and extract data biological context

Parse files of data and information: fasta embl / genbank format html (web-pages) user input to programs So Why Regex? Check format Find illegal characters (validation) Search for sequences motifs

Simple Patterns place regex between pair of forward slashes (/ /). try: #!/usr/bin/perl while (<STDIN>) { if (/abc/) { print 1 >> $_ ; Run the script. Type in something that contains abc: abcfoobar Type in something that doesn't: fgh cba foobar ab c foobar print statement is returned if abc is matched within the typed input.

Can also match strings from files. Simple Patterns (2) genomes_desc.txt contains a few text lines containing information about three genomes. try: #!/usr/bin/perl open IN, <genomes_desc.txt ; while (<IN>) { if (/elegans/) { #match lines with this regex print; #print lines with match Parses each line in turn. Looks for elegans anywhere in line $_

Flexible matching There are many characters with special meanings metacharacters. star (*) matches any number of instances /ab*c/ => 'a' followed by zero or more 'b' followed by 'c' => abc or abbbbbbbc or ac plus (+) matches at least one instance /ab+c/ => 'a' followed by one or more 'b' followed by 'c' => abc or abbc or abbbbbbbbbbbbbbc NOT ac question mark (?) matches zero or one instance /ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c' => abc or ac

More General Quantifiers Match a character a specific number or range of instances {x will match x number of instances. /ab{3c/ => abbbc {x,y will match between x and y instances. /a{2,4bc/ => aabc or aaabc or aaaabc {x, will match x+ instances. /abc{3,/ => abccc or abccccccccc or abcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccc

More metacharacters dot (.) refers to any character even tab (\t) and space but not newline (\n). /a.*c/ => 'a' followed by any number of any characters followed by 'c'

Escaping But I want to use these symbols in my regex!?! to use a *, +,? or. in the pattern when not a metacharacter, need to 'escape' them with a backslash. /C\. elegans/ => C. elegans only /C. elegans/ => Ca, Cb, C3, C>, C., etc... The 'delimitor' of the regex, forward slash /, and the 'escape' character, backslash \, are also metacharacters. These need to be escaped if required in regex. Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/ /www\.envgen\.nox\.ac\.uk\/biolinux\.html/

Using metacharacters. The file nemaglobins.embl contains 21 embl database files that contain a globin protein within their sequence. try: #!/usr/bin/perl $count; open IN, <nemaglobins.embl or die; while (<IN>) { if (/AC.*/) { #that's three spaces print; $count++; print total=$count\n ;

Grouping Patterns Can group patterns in parentheses (). Useful when coupled with quantifiers /elegans+/ => eleganssssssssssssss /(elegans)+/ => eleganselegans...elegans 1 2 n /eleg(ans){4/ => elegansansansans 1 2 3 4

Alternatives Want either this pattern or that pattern. Two ways: 1.) the vertical bar ' ' either the left side matches or the right side matches /(human mouse rat)/ => any string with human or mouse or rat. Combine with previous examples: /Fugu( \t)+rubripes/ matches if Fugu and rubripes are seperated by any mixture of spaces and tabs

2.) character class is a list of characters within '[]'. It will match any single character within the class. /[wxyz1234\t]/ => any of the nine. a range can be specified with '-' /[w-z1-4\t]/ => as above to match a hyphen it must be first in the class /[-a-za-z]/ => any letter character and a hyphen negating a character with '^' /[^z]/ => any character except z /[^abc]/ => any character except a or b or c

Other Shortcuts \d => any digit [0-9] \w => any word character [A-Za-z0-9_] \s => any white space [\t\n\r\f ] \D => any character except a digit [^\d] \W => any character except a word character [^\w] \S => any character except a white space [^\s] Can use any of these in conjunction with quantifiers, /\s*/ => any amount of white space

Using alternatives to find a hydrophobic region... try: open IN, "< nippo_sigpept.fsa" or die; while (<IN>) { if (/>/) { #a header line $count++; #keep running total of sequence number else { #not a header if (/[VILMFWCA]{8,/) { $match++; print "Hydrophobic region found in $match sequences from $count\n"; Could also have used /(V I L M F W C A){8,/

Revisited? So far matching against $_ Binding Operator The binding operator =~ matches the pattern on right against the string on left. Usually add the m operator (optional). $sumthing = 'Ascaris suum is a nematode'; if ($sumthing=~m/suum.*nematode/) { print this organism infects pigs!\n ;

Anchors /pattern/ will match anywhere in the string. Use anchors to hold pattern to a point in the string. caret ^ (shift 6) marks the beginning of string while dollar $ marks end of a string. /^elegans/ => elegans only at start of string. Not C. elegans. /Canis$/ => Canis only at end of string. Not Canis lupus. /^\s*$/ => a blank line. $ ignores new line character \n. N.B. compare use of ^ as an anchor with that in the character class.

Word Boundary Anchors (2) \b matches the start or end of a word. /\bmus\b/ would match mus but not musculus /la\b/ => Drosophila but not Plasmodium /\btes/ => Comamonas testosteroni but not Pan troglodytes \b ignores newline character. Be careful with full stops they're characters too!

Memory Variables Able to extract sections of the pattern match and store in a variable. Anything stored in parentheses () is written into a special variable. The first instance is $1, the second $2, the fourth $4 and so on. Extract from file: Organism: Homo sapiens... Extract from Perl script: while ($line=<in>) { if ($line=~m/organism:\s(\w)+\s(\w)+/) { $genus=$1; #stores Homo $species=$2; #stores sapiens

Substitutions Able to replace a pattern within a string with another string. Use the s operator s/abc/xyz/ => find abc and replace with xyz By default only the first instance of a match. Using 'g' modifier (global) will find and replace all instances. $line = 'abccdcbabc'; $line =~ s/abc/xyz/g; print $line; #produces xyzcdcbxyz; Run dna2rna.pl Now look at dna2rna.pl 1 2

dna2rna.pl #!/usr/bin/perl print "Enter DNA sequence\n"; while ($line = <STDIN>) { chomp $line; #remove trailing \n if ($line=~m/[^agct]/i){ #case insensitive infered by 'i' #modifier print "your sequence contained an invalid nucleotide: $&\nplease try again\n"; #'$&' is a special variable which stores what the #regular expression matched. Don't worry about it for now. else { $line=~s/t/u/g; #replace all lower case 't' $line=~s/t/u/g; #replace all upper case 'T' print "The RNA sequence is:\n$line\n"; print Try again or ctrl C to quit\n ;

EMBL file revisited using shortcuts and anchors to help make more robust: if (/AC.*/) { #that's three spaces can be rewritten as; if (/^AC\s{3(.*)\n$/){ #more certain to return what you want $accession=$1; #now have info stored to use later.

Now Its Your Turn :o) nemaglobins.embl contains entries for complete cds of nematode sequences. Foreach entry print the ACcession, OrganiSm name and AGCT content of the SeQuence. Output should read: Accession: AC00000 <tab> Species: Toxocara canis <newline> A: 34 G: 65 C: 24 T: 75 <newline><newline> Hints: The lines of interest are AC, OS, and SQ. Three regular expressions - one for each query. Use a series of if and elsif loops to search for regular expressions. Print when matched. Bonus point - remove the semi-colon from the accession id. Shout if need help.