Regular Expressions. Chapter 11. Python for Informatics: Exploring Information www.pythonlearn.com

Similar documents
Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

University Convocation. IT 3203 Introduction to Web Development. Pattern Matching. Why Match Patterns? The Search Method. The Replace Method

Lecture 18 Regular Expressions

Python Lists and Loops

Python Loops and String Manipulation

Relational Databases and SQLite

A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming

Regular Expression. n What is Regex? n Meta characters. n Pattern matching. n Functions in re module. n Usage of regex object. n String substitution

NiCE Log File Management Pack. for. System Center Operations Manager Quick Start Guide

CS 1133, LAB 2: FUNCTIONS AND TESTING

Programming Languages CIS 443

Selecting Features by Attributes in ArcGIS Using the Query Builder

Custom Linetypes (.LIN)

Exercise 4 Learning Python language fundamentals

Utility Software II lab 1 Jacek Wiślicki, jacenty@kis.p.lodz.pl original material by Hubert Kołodziejski

Form Validation. Server-side Web Development and Programming. What to Validate. Error Prevention. Lecture 7: Input Validation and Error Handling

Regular Expression Searching

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

Regular Expression Syntax

PYTHON Basics

6.170 Tutorial 3 - Ruby Basics

Python Programming: An Introduction to Computer Science

Chapter 9: Building Bigger Programs

b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true

Udacity cs101: Building a Search Engine. Extracting a Link

Python Objects. Charles Severance

grep, awk and sed three VERY useful command-line utilities Matt Probert, Uni of York grep = global regular expression print

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets

monoseq Documentation

Library Intro AC800M

ESET NOD32 Antivirus for IBM Lotus Domino. Installation

Compilers Lexical Analysis

Eventia Log Parsing Editor 1.0 Administration Guide

Outline Basic concepts of Python language

Kiwi Log Viewer. A Freeware Log Viewer for Windows. by SolarWinds, Inc.

Lecture 2, Introduction to Python. Python Programming Language

Learn Perl by Example - Perl Handbook for Beginners - Basics of Perl Scripting Language

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Regular Expressions. The Complete Tutorial. Jan Goyvaerts

Computational Mathematics with Python

Crochet Pattern Autumn Girls Design by K. Godinez

Java String Methods in Print Templates

Comp 248 Introduction to Programming

ELFRING FONTS UPC BAR CODES

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Advanced PostgreSQL SQL Injection and Filter Bypass Techniques

Introduction to Python

PIC 10A. Lecture 7: Graphics II and intro to the if statement

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Introduction to Python for Text Analysis

Chapter 3 Writing Simple Programs. What Is Programming? Internet. Witin the web server we set lots and lots of requests which we need to respond to

w e p r o t e c t d i g i t a l w o r l d s ESET NOD32 Antivirus for IBM Lotus Domino Installation

#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10

Microsoft Access 3: Understanding and Creating Queries

3 Improving the Crab more sophisticated programming

Who Are You Online? LESSON PLAN UNIT 2. Essential Question How do you present yourself to the world online and offline?

Basics of I/O Streams and File I/O

Bottom-Up Parsing. An Introductory Example

CS106A, Stanford Handout #38. Strings and Chars

Version August 2016

Regular Expressions. General Concepts About Regular Expressions

Getting Started with the new VWO

Introduction to Microsoft Word

BACKSCATTER PROTECTION AGENT Version 1.1 documentation

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

Introduction to Learning & Decision Trees

Getting started with the Asset Import Converter. Overview. Resources to help you

Computational Mathematics with Python

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

VMware vcenter Log Insight User's Guide

Hypercosm. Studio.

People Data and the Web Forms and CGI CGI. Facilitating interactive web applications

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.


Viewing Web Pages In Python. Charles Severance -

X1 Professional Client

TCP/IP Addressing and Subnetting. an excerpt from: A Technical Introduction to TCP/IP Internals. Presentation Copyright 1995 TGV Software, Inc.

Security implications of the Internet transition to IPv6

Principles of Database Management Systems. Overview. Principles of Data Layout. Topic for today. "Executive Summary": here.

Save Actions User Guide

25 Tips for Creating Effective Load Test Scripts using Oracle Load Testing for E-Business Suite and Fusion Applications.

Format OCR ICR. ID Protect From Vanguard Systems, Inc.

Simple SEO Success. Google Analytics & Google Webmaster Tools

A GIRL SCOUT YEAR. If the answer is YES, we want to do all the activities an earn the A Girl Scout Year patch, put the date you decided here:


Essential SharePoint Search Hints for 2010 and Beyond

Tabletop Hopscotch Learn about online safety with this fun game

Excel: Introduction to Formulas

Unix Shell Scripts. Contents. 1 Introduction. Norman Matloff. July 30, Introduction 1. 2 Invoking Shell Scripts 2

Data Protection. Administrator Guide

Real SQL Programming 1

K-1 Common Core Writing Santa Fe Public Schools Presented by: Sheryl White

Transcription:

Regular Expressions Chapter 11 Python for Informatics: Exploring Information www.pythonlearn.com

Regular Expressions In computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor. http://en.wikipedia.org/wiki/regular_expression

Regular Expressions Really clever wild card expressions for matching and parsing strings http://en.wikipedia.org/wiki/regular_expression

Really smart "Find" or "Search"

Understanding Regular Expressions Very powerful and quite cryptic Fun once you understand them Regular expressions are a language unto themselves A language of marker characters - programming with characters It is kind of an old school language - compact

http://xkcd.com/208/

Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line. Matches any character \s Matches whitespace \S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a character one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end

The Regular Expression Module Before you can use regular expressions in your program, you must import the library using "import re" You can use re.search() to see if a string matches a regular expression, similar to using the find() method for strings You can use re.findall() extract portions of a string that match your regular expression similar to a combination of find() and slicing: var[5:10]

Using re.search() like find() hand = open('mbox-short.txt') for line in hand: line = line.rstrip() if line.find('from:') >= 0: print line import re hand = open('mbox-short.txt') for line in hand: line = line.rstrip() if re.search('from:', line) : print line

Using re.search() like startswith() hand = open('mbox-short.txt') for line in hand: line = line.rstrip() if line.startswith('from:') : print line import re hand = open('mbox-short.txt') for line in hand: line = line.rstrip() if re.search('^from:', line) : print line We fine-tune what is matched by adding special characters to the string

Wild-Card Characters The dot character matches any character If you add the asterisk character, the character is any number of times X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X-DSPAM-Confidence: 0.8475 X-Content-Type-Message-Body: text/plain ^X.*:

Wild-Card Characters The dot character matches any character If you add the asterisk character, the character is any number of times Match the start of the line X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X-DSPAM-Confidence: 0.8475 X-Content-Type-Message-Body: text/plain ^X.*: Many times Match any character

Fine-Tuning Your Match Depending on how clean your data is and the purpose of your application, you may want to narrow your match down a bit X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X-Plane is behind schedule: two weeks Match the start of the line Many times ^X.*: Match any character

Fine-Tuning Your Match Depending on how clean your data is and the purpose of your application, you may want to narrow your match down a bit X-Sieve: CMU Sieve 2.3 X-DSPAM-Result: Innocent X-Plane is behind schedule: two weeks Match the start of the line One or more times ^X-\S+: Match any non-whitespace character

Matching and Extracting Data The re.search() returns a True/False depending on whether the string matches the regular expression If we actually want the matching strings to be extracted, we use re. findall() [0-9]+ One or more digits >>> import re >>> x = 'My 2 favorite numbers are 19 and 42' >>> y = re.findall('[0-9]+',x) >>> print y ['2', '19', '42']

Matching and Extracting Data When we use re.findall(), it returns a list of zero or more sub-strings that match the regular expression >>> import re >>> x = 'My 2 favorite numbers are 19 and 42' >>> y = re.findall('[0-9]+',x) >>> print y ['2', '19', '42'] >>> y = re.findall('[aeiou]+',x) >>> print y []

Warning: Greedy Matching The repeat characters (* and +) push outward in both directions (greedy) to match the largest possible string >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^f.+:', x) >>> print y ['From: Using the :'] ^F.+: One or more characters Why not 'From:'? First character in the match is an F Last character in the match is a :

Non-Greedy Matching Not all regular expression repeat codes are greedy! If you add a? character, the + and * chill out a bit... >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^f.+?:', x) >>> print y ['From:'] ^F.+?: One or more characters but not greedy First character in the match is an F Last character in the match is a :

Fine-Tuning String Extraction You can refine the match for re.findall() and separately determine which portion of the match is to be extracted by using parentheses From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 >>> y = re.findall('\s+@\s+',x) >>> print y ['stephen.marquard@uct.ac.za'] \S+@\S+ At least one non-whitespace character

Fine-Tuning String Extraction Parentheses are not part of the match - but they tell where to start and stop what string to extract From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 >>> y = re.findall('\s+@\s+',x) >>> print y ['stephen.marquard@uct.ac.za'] >>> y = re.findall('^from:.*? (\S+@\S+)',x) >>> print y ['stephen.marquard@uct.ac.za'] ^From:(\S+@\S+)

21 31 From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 >>> data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' >>> atpos = data.find('@') >>> print atpos 21 >>> sppos = data.find(' ',atpos) >>> print sppos Extracting a host 31 >>> host = data[atpos+1 : sppos] name - using find >>> print host and string slicing uct.ac.za

The Double Split Pattern Sometimes we split a line one way, and then grab one of the pieces of the line and split that piece again From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 words = line.split() email = words[1] pieces = email.split('@') print pieces[1] stephen.marquard@uct.ac.za ['stephen.marquard', 'uct.ac.za'] 'uct.ac.za'

The Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('@([^ ]*)',lin) print y ['uct.ac.za'] '@([^ ]*)' Look through the string until you find an at sign

The Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('@([^ ]*)',lin) print y ['uct.ac.za'] '@([^ ]*)' Match non-blank character Match many of them

The Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('@([^ ]*)',lin) print y ['uct.ac.za'] '@([^ ]*)' Extract the non-blank characters

Even Cooler Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('^from.*@([^ ]*)',lin) print y ['uct.ac.za'] '^From.*@([^ ]*)' Starting at the beginning of the line, look for the string 'From '

Even Cooler Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('^from.*@([^ ]*)',lin) print y ['uct.ac.za'] '^From.*@([^ ]*)' Skip a bunch of characters, looking for an at sign

Even Cooler Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('^from.*@([^ ]*)',lin) print y ['uct.ac.za'] '^From.*@([^ ]*)' Start extracting

Even Cooler Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('^from.*@([^ ]*)',lin) print y ['uct.ac.za'] '^From.*@([^ ]*)' Match non-blank character Match many of them

Even Cooler Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' y = re.findall('^from.*@([^ ]*)',lin) print y ['uct.ac.za'] '^From.*@([^ ]*)' Stop extracting

Spam Confidence import re hand = open('mbox-short.txt') numlist = list() for line in hand: line = line.rstrip() stuff = re.findall('^x-dspam-confidence: ([0-9.]+)', line) if len(stuff)!= 1 : continue num = float(stuff[0]) numlist.append(num) print 'Maximum:', max(numlist) X-DSPAM-Confidence: 0.8475 python ds.py Maximum: 0.9907

Escape Character If you want a special regular expression character to just behave normally (most of the time) you prefix it with '\' At least one >>> import re >>> x = 'We just received $10.00 for cookies.' >>> y = re.findall('\$[0-9.]+',x) >>> print y ['$10.00'] \$[0-9.]+ or more A real dollar sign A digit or period

Summary Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings Regular expressions have special characters that indicate intent

Acknowledgements / Contributions These slides are Copyright 2010- Charles R. Severance (www. dr-chuck.com) of the University of Michigan School of Information and open.umich.edu and made available under a Creative Commons Attribution 4.0 License. Please maintain this last slide in all copies of the document to comply with the attribution requirements of the license. If you make a change, feel free to add your name and organization to the list of contributors on this page as you republish the materials.... Initial Development: Charles Severance, University of Michigan School of Information Insert new Contributors and Translations here