Regular Expression. n What is Regex? n Meta characters. n Pattern matching. n Functions in re module. n Usage of regex object. n String substitution

Similar documents

Lecture 18 Regular Expressions

Regular Expressions. Chapter 11. Python for Informatics: Exploring Information

University Convocation. IT 3203 Introduction to Web Development. Pattern Matching. Why Match Patterns? The Search Method. The Replace Method

Python: Regular Expressions

Programming Languages CIS 443

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Regular Expressions for Natural Language Processing

Content of this lecture. Regular Expressions in Java. Hello, world! In Java. Programming in Java

Chapter 2 Writing Simple Programs

Mastering Python Regular Expressions

Exploring Regular Expression Usage and Context in Python

Compiler Construction

6.170 Tutorial 3 - Ruby Basics

Regular Expression Syntax

Lecture 4. Regular Expressions grep and sed intro

Regular Expressions for Perl, C, PHP, Python, Java, and.net. Regular Expression. Pocket Reference. Tony Stubblebine

Python Lists and Loops

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

Viewing Web Pages In Python. Charles Severance -

Kiwi Log Viewer. A Freeware Log Viewer for Windows. by SolarWinds, Inc.

ESCI 386 Scientific Programming, Analysis and Visualization with Python. Lesson 5 Program Control

CSE 341 Lecture 28. Regular expressions. slides created by Marty Stepp

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

public static void main(string[] args) { System.out.println("hello, world"); } }

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Introduction to Python

Number Representation

Computational Mathematics with Python

Web Design and Development ACS Chapter 7. Working with Links

Conditionals (with solutions)

Interspire Website Publisher Developer Documentation. Template Customization Guide

Introduction to Searching with Regular Expressions

NiCE Log File Management Pack. for. System Center Operations Manager Quick Start Guide

Computational Mathematics with Python

awk A UNIX tool to manipulate and generate formatted data

Being Regular with Regular Expressions. John Garmany Session

CS106A, Stanford Handout #38. Strings and Chars

Informatica e Sistemi in Tempo Reale

Lecture 2, Introduction to Python. Python Programming Language

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

Chapter 3 Writing Simple Programs. What Is Programming? Internet. Witin the web server we set lots and lots of requests which we need to respond to

Pemrograman Dasar. Basic Elements Of Java

The Clean programming language. Group 25, Jingui Li, Daren Tuzi

Grinder Webtest Documentation

The C Programming Language course syllabus associate level

Regular Expressions and Pattern Matching

ANNEXURE - I. Back-office Interfaces

Regular Expressions. General Concepts About Regular Expressions

Analog Documentation. Release Fabian Büchler

Introduction to Java. CS 3: Computer Programming in Java

Introduction to Java

Topics. Parts of a Java Program. Topics (2) CS 146. Introduction To Computers And Java Chapter Objectives To understand:

Some Scanner Class Methods

SPSS INSTRUCTION CHAPTER 1

JAVA - QUICK GUIDE. Java SE is freely available from the link Download Java. So you download a version based on your operating system.

Writing Functions in Scheme. Writing Functions in Scheme. Checking My Answer: Empty List. Checking My Answer: Empty List

Introduction to: Computers & Programming: Input and Output (IO)

Why is Internal Audit so Hard?

Python. KS3 Programming Workbook. Name. ICT Teacher Form. Do you speak Parseltongue?

User-ID Features. PAN-OS New Features Guide Version 6.0. Copyright Palo Alto Networks

Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration

Regular Expressions. The Complete Tutorial. Jan Goyvaerts

A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

CS177 MIDTERM 2 PRACTICE EXAM SOLUTION. Name: Student ID:

Numeral Systems. The number twenty-five can be represented in many ways: Decimal system (base 10): 25 Roman numerals:

pyownet Documentation

monoseq Documentation

CS 1133, LAB 2: FUNCTIONS AND TESTING

Hands-On Python A Tutorial Introduction for Beginners Python 3.1 Version

Apache Cassandra Query Language (CQL)

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Auto Generate Purchase Orders from S/O Entry SO-1489

Using PRX to Search and Replace Patterns in Text Strings

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

Exercise 4 Learning Python language fundamentals

1.1 AppleScript Basics

Interfacing the ABS Software. With An. MRP or ERP System

[1] Learned how to set up our computer for scripting with python et al. [3] Solved a simple data logistics problem using natural language/pseudocode.

Python for Economists

Obfuscation: know your enemy

Introduction to Apache Pig Indexing and Search

Computers. An Introduction to Programming with Python. Programming Languages. Programs and Programming. CCHSG Visit June Dr.-Ing.

Compilers Lexical Analysis

Slides from INF3331 lectures - web programming in Python

Time Clock Import Setup & Use

Introduction Header Body content Todo List. Analysis. Joe Huang Anti-Spam Team Cellopoint. February 21, Joe Huang Analysis 1 / 62

Computer Programming Tutorial

Programming a mathematical formula. INF1100 Lectures, Chapter 1: Computing with Formulas. How to write and run the program.

03 - Lexical Analysis

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

BACKSCATTER PROTECTION AGENT Version 1.1 documentation

Transcription:

Regular Expression n What is Regex? n Meta characters n Pattern matching n Functions in re module n Usage of regex object n String substitution

What is regular expression? n String search n (e.g.) Find integer/float numbers in the given string given below. 123ABC D123 4 5.6 789 0.12.3 D.E024 n Regular Expression (i.e., Regex) n Define search, extraction, substitution of string patterns n Python supports own regex module - re

Meta characters n Meta characters? n Control characters that describe string patterns, rather than strings themselves n Repeating meta characters character meaning e.g. * 0 or more ca*t è ct, cat, caat, caaaaaat + 1 or more ca+t è cat, caat, caaaat? 0 or 1 ca?t è ct, cat {m} m times ca{2} è caa {m, n} At least m, at most n ca{2,4} è caat, caaat, caaaat

Matching meta characters Meta character meaning. Match all characters except the new line character re.dotall can match all characters. ^ 1. Matches the first character of the class 2. Matches beginning of each line in re.multiline 3. Used for complementing the set if used in [] $ 1. Matches the last character of the class 2. Used as $ character itself if used in [] [] Indicates a selective character set A B means A or B. () Used for grouping regex

Match n match(pattern, string [, flags]) n n (e.g.) Return: if matched, return matched object else, return None >>> import re >>> re.match( [0-9], 1234 ) <SRE_Match object at 00A03330> >>> re.match( [0-9], abc ) >>>

Match (more examples) >>> m = re.match( [0-9], 1234 ) >>> m.group() 1 >>> m = re.match( [0-9]+, 1234 ) >>> m.group() 1234 >>> m = re.match( [0-9]+, 1234 ) >>> m.group() 1234 >>> re.match( [0-9]+, 1234 ) >>> re.match(r [ \t]*[0-9]+, 123 ) <_sre.sre_match object at.>

(pre-defined) special symbols symbol meaning equivalence \\ Backslash (i.e., \ ) itself \d Matches any decimal digit [0-9] \D Matches any non-digit characters [^0-9] \s Matches any whitespace character [ \t\n\r\f\v] \S Matches any non-whitespace character [^ t\n\r\f\v] \w Matches any alphanumeric character [a-za-z0-9_] \W Matches any non-alphanumeric character [^a-za-z0-9_] \b Indicates boundary of a string \B Indicates no boundary of a string

(e.g) special symbols >>> m = re.match( \s*\d+, 123 ) >>> m.group() 123 >>> m = re.match( \s*(\d+), 1234 ) >>> m.group(1) 1234

Search n Match vs. Search n n Match from the beginning of a string Search partial string match in any position n Search(pattern, string [, flags]) n (e.g) >>> re.search( \d+, 1034 ).group() 1034 >>> re.search( \d+, abc123 ).group() 123

Regex in raw mode >>> \d \\d >>> \s \\s >>> \b \x08 >>> \\b \\b >>> r \b \\b

>>> \\ \\ >>> r \\ \\\\ >>> re.search( \\\\, \ ).group() \\ >>> re.search(r \\, \ ).group() \\

Greedy/non-greedy matching n Non-greedy matching Symbol Meaning *? * with non-greedy matching +? + with non-greedy matching??? With non-greedy matching {m, n}? {m, n} with non-greedy matching

Greedy/non-greedy (e.g.) >>> s = <a href= index.html >HERE</a><font size= 10 > >>> re.search(r href= (.*), s).group(1) index.html >HERE</a><font size= 10 >>> re.search(r href= (.*?), s).group(1) index.html

Extracting matched string n Mathods in match() object Method Return group() Matched strings start() end() span() Starting position of matched string Ending position of matched string Tuple of (starting, ending) of matched string

Extracting matched string (cont.) n Group()-related methods Method group() return whole strings matched group(n) N-th matched string. group(0) is equivalent to group() groups() All strings matched in a tuple format

E.g., Group() >>> m = re.match(r (\d+)(\w+)(\d+), 123abc456 ) >>> m.group() 123abc456 >>> m.group(0) >>> m.group() 123abc456 ( 123, abc45, 6 ) >>> m.group(1) >>> m.start() 123 0 >>> m.group(2) >>> m.end() abc45 9 >>> m.group(3) >>>m.span() 6 (0, 9)

E.g. group() (cont.) >>> m = re.match( (a(b(c)) (d)), abcd ) >>> m.group(0) abcd >>> m.group(1) abcd >>> m.group(2) bc >>> m.group(3) c >>> m.group(4) d

Group name n Pattern: (?P<name>regex) n Name: group name n Regex: regular expression n (e.g.) >>> m = re.match(r (?P<var>[_a-zA-Z]\w*)\s* =\s*(?p<num>\d+), a = 123 ) >>> m.group( var ) a >>> m.group( num ) 123 >>> m.group() a = 1 2 3

Methods in re module Method compile(pattern [, flags]) search(pattern, string [, flags]) match(patterns, sting [, flags]) split(pattern, string [, maxsplit=0]) findall(pattern, string) sub(pattern, repl, string [, count=0]) function compile search match from the begining split extracting all strings matched substitute pattern with repl subn(pattern,repl, string [, count=0]) substitute with count times escape(string) add backslash to nonalphanumeric strings

Re.split(pattern, string [, maxsplit=0]) >>> re.split( \W+, apple, orange and spam. ) [ apple, orange, and, spam, ] >>> re.split( (\W+), apple, orange and spam. ) [ apple,,, orange,, and,, spam,., ] >>> re.split( \W}, apple, orange and spam., 1) [ apple, orange and spam. ] >>> re.split( \W}, apple, orange and spam., 2) [ apple, orange, and spam. ] >>> re.split( \W}, apple, orange and spam., 3) [ apple, orange, and, spam. ]

findall(pattern, string) >>> re.findall(r [_a-za-z]\w*, 123 abc 123 def ) [ abc, def ] >>> s = <a href= link1.html >link1</a> <a href= link2.html >link2</a> <a href= link3.html >link3</a> >>> re.findall(r herf= (.*?), s) [ link1.html, link2.html, link3.html ]

compile(patern [, flags]) n Returns a regex object that is compiled n Useful for repeated many times n (e.g) >>> p = re.compile(r ([_a-za-z]\w*\s*=\s*(\d+) ) >>> m = p.match( a = 123 ) >>> m.groups() ( a, 123 ) >>> m.group(1) a >>> m.group(2) 123

Flags for regex object flag meaning I, IGNORECASE Ignore case of characters L, LOCALE \w, \W, \b, \B are affected by current locale M, MULTILINE ^ and $ are applied to the beginning and end of each string S, DOTALL. Matches new line character (i.e., \n) as well U, UNICODE \w, \W, \b, \B are processed according to unicode X, VERBOSE Blanks inside regex are ignored. Backslash(\) needs to be used to consider blanks. # can be used for comments

E.g. >>> import re >>> p = re.compile( the, re.i) >>> p.findall( The cat was hungry. They were scared because of the cat ) [ The, The, the ]

Methods of Regex object Method search(string [, pos [, endpos]]) match(string [, pos [, endpos]]) split(string [, maxsplit = 0]) findall(string) sub(repl, string [, count =0]) subn(repl, string [, count = 0]) meaning search match from the begining re.split() re.findall() re.sub() re.subn()

Search(string [, pos [, endpos]]) >>> t = My ham, my egg, my spam >>> p = re.compiile( (my), re.i) >>> pos = 0 >>> while 1: m = p.search(t, pos) if m: print m.group(1) pos = m.end() else: break My my my

Sub(repl, string [, count = 0]) >>> re.sub(r [.,:;],, a:b;c, d. ) abc d >>> re.sub(r \W,, a:b:c, d. ) abcd

{\n}, {g\<n>}: substitute with string of group n >>> p = re.compile( section{ ( [^}]* ) }, re.verbose) >>> p.sub(r subsection{\1}, section{first} section{second} ) subsection{first} subsection{second}

{\g<name>}: substitute with group name >>> p = re.compile( section{ (?P<name> [^}]* ) }, re.verbose) >>> p.sub(r subsection{\g<name>}, section{first} ) subsection{first}

Substitute with function >>> def hexrepl( match): value = int (match.group()) return hex(value) >>> p = re.compile(r \d+ ) >>> p.sub(hexrepl, Call 65490 for printing, 49152 for user code. ) Call 0xffd2 for printing, 0xc000 for user code.