IT 3203 Introduction to Web Development Regular Expressions October 12 Notice: This session is being recorded. Copyright 2007 by Bob Brown University Convocation Tuesday, October 13, 11:00 AM 12:15 PM Student Center Theatre Convocation Speaker: Dr. John Palfrey Speaking on Born Digital in a Network Society Professor at Harvard Law School Vice Dean for Library and Information Resources Co-author of Born Digital: Understanding the First Generation of Digital Natives and also Access Denied: The Practice and Politics of Internet Filtering Pattern Matching Pattern matching in JavaScript is based on regular expressions. Regular expressions are patterns that are compared with strings or substrings In reality, regular expressions are a small formal language. Two approaches in JavaScript: regexp object methods of the string object Why Match Patterns? Most data validation that can be done on the client-side consists of testing data for conformance to a pattern. Telephone numbers Email addresses Dates Money amounts what else? The Search Method Search is a method of the string object var my_string = "Abernathy"; var my_pos = my_string.search(/er/); My_pos becomes 2. /er/ is a pattern. The search method searches for the pattern in the string. Returns -1 if there is no match. The Replace Method var bobs = "Bob, Bobbie"; bobs.replace(/bob/g, "Bill"); The string bobs now contains Bill, Billbie /Bob/ is a pattern, but Bill is just a string. The g means global.
The Match Method Match is the most general of the methods var fruit = "4 apples 3 oranges"; var my_nbrs = fruit.match(/\d/g); my_nbrs contains [4, 3] (it s an array) g all matches no g first match, plus parenthesized subpatterns \d matches digits ( and \D matches non-digits.) Forming Regular Expressions / / enclose patterns normal characters match themselves (e.g. rabbit ) Metacharacters have special meanings \ ( ) [ ] { } ^ $ * +?. Metacharacters can be included in patterns by escaping with a backslash, like \$ A real dollar sign Wildcard Matching. (period) matches any character except newline /snow./ matches snows, snowy matches snowi in snowing Classes [ ] (brackets) define classes [abc] /[abc]/ matches a or b or c /[a-h]/ matches lower-case a through h ^ (circumflex) inverts a class /[^aeiou]/ matches all except a,e,i,o,u Predefined Classes \x backslash and class abbreviation See your textbook or a JavaScript reference \d matches a digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 /\d+\.\d*/ One or more digits a period zero or more digits Word and Space Characters Word characters: [a-za-z0-9_] \w Non-word characters: [^a-za-z0-9_] \W Space characters: space, tab, new line: Non-space characters: \s \S Capitalization reverses the sense of the predefined class names.
Boundary Matches \b matches boundary between word and non-word Foo baz zero-length match This allows a whole-words-only search. /Fred\b/ Matches Fred is but not Frederick is /Fred\B/ Matches Frederick is but not Fred is \bis\b matches is in: This island is beautiful Repetition * zero or more + one or more? one or none { } a count (applies to pattern character on left) /xy{4}z/ == /xyyyyz/ /X*y+z?/ Repetition Examples * zero or more + one or more? one or none /\d*\.\d+/ /\d*\.?\d*/ Repetition Exercise /\d*\.\d+/ 1. 0.0 2..25 3. 137 4. 137. 5. 4.5678 6. xyz.123 Can We Fix The Pattern? Assume we are trying to match valid numbers in various combinations with decimal point. Is this any better? (Not much!) /\d+\.?\d*/ 1. 0.0 2..25 3. 137 4. 137. 5. 4.5678 6. xyz.123 Repetition Exercise: Case 2 /\d+\.?\d*/ 2..25 This expression does match test case 2 at position 1, the digit 2. But the decimal point is skipped by \d+, which matches 25 \.? makes (another) decimal optional \d* matches nothing It also matches within:.25.67! Why? What about:.25.67.89?
Repetition Exercise: Case 6 /\d+\.?\d*/ 6. xyz.123 This expression does match test case 6 at position 4, the digit 1. But the decimal point is skipped by \d+, which matches 123 \.? makes (another) decimal optional \d* matches nothing Another Repetition Exercise /X*y+z?/ 1. Xyyyz 2. Xzzy 3. yyyyz 4. yyyy 5. wxyzz 6. zzzxyzz Anchors Specify where to start matching /^pearl/ Match starts at beginning of string pearls are... but not my pearls... Same character as pattern inversion, but different context, different meaning. /gold$/ Anchors to end of string I like gold but not sunset is golden Grouping and Alternatives Parentheses group items. The pipe or vertical bar matches one of two or more alternatives. abc(def xyz) Matches ABCDEF or ABCXYZ Now We Can Fix The Pattern Almost! We are trying to match either a digit or a decimal point: If a decimal point, then one or more digits Otherwise, an optional decimal point followed by zero or more digits. /^\d*( \.\d*)?$/ Problem: This matches a decimal point all by itself. To fix, we need conditional expressions, which are beyond the scope of the course because conditionals are not supported in JavaScript. A Closer Look /^\d*( \.\d*)?$/ Anchored at the beginning of the string Zero or more digits A group containing either nothing, or a decimal point and zero or more digits, Repeated zero or one times. Anchored at the end of the string
Did That Work? 1. 0.0 /^\d*( \.\d*)?$/ 2..25 3. 137 4. 137. 5. 4.5678 6. xyz.123 7.. Follow the pattern: g global i case-insensitive /buffalo/i Modifiers Matches Buffalo and buffalo The Split Method Splits a string into substrings Returns an array of substrings var my_str = "grapes:apples:oranges"; var fruit = my_str.split(":"); fruit is ["grapes", "apples", "oranges"] Split can take a regular expression as a delimiter Split with a Regular Expression Splitting a comma-delimited string: var my_nbrs = "12,34,56"; var nbr_array = my_nbrs.split(","); What about this? var my_nbrs = "12, 3,4, 56"; nbr_array=my_nbrs.split(/\s*,\s*/); A 7-Digit Phone Number How does this work? var ok = phnum.search(/\d{3}-\d{4}/); What does the search method return for this? 555-1212 A 7-Digit Phone Number How does this work? var ok = phnum.search(/\d{3}-\d{4}/); What about this? 444555-12123456
A 7-Digit Phone Number How does this work? var ok = phnum.search(/\d{3}-\d{4}/); 10-Digit Phone Number Can it be extended for Atlanta-style phone numbers? var ok=phnum.search(/^\d{3}-\d{3}-\d{4}$/); What about this? 444555-12123456 var ok = phnum.search(/^\d{3}-\d{4}$/); Anchoring the beginning and end gives an expression that works: No match here! 10-Digit Phone Number Can the format be made less rigid? (Yes!) /^\(?\d{3}\d*\d{3}\d*\d{4}$/ Anchor at the beginning of the string Optional left parenthesis Three digits Optional non-digits Three digits Optional non-digits Four digits Anchored at the end of the string. Accepting Free-Form Phone Numbers Parentheses act as grouping and storage operators. var ok = datum.search(/^\(?\d{3}\d*\d{3}\d*\d{4}$/); if (ok==0) { var parts = datum.match (/^\(?(\d{3})\d*(\d{3})\d*(\d{4})$/); output.value='('+parts[1]+') '+parts[2]+'-'+parts[3]; } Accepts: 404-555-1234, 4045551234, (404) 555-1234, etc. Returns: (404) 555-1234 Regular Expressions as NFAs Nondeterministic Finite Automata Nondeterministic is not the same as random Each part of a regular expression will match as much as it can..* matches to end of string! The regular expression engine backtracks when necessary, i.e. when a match would otherwise fail. Regular Expressions are Greedy A regular expression will match as much of the target string as possible /2.*2/ 19202122232425252627282930313233 19202122232425252627282930313233
Regular Expressions are Greedy Consider parsing HTML with a regular expression. Stars by the <b>billions</b> and <b>billions</b>. Regular Expressions are Greedy Consider parsing HTML with a regular expression. Stars by the <b>billions</b> and <b>billions</b>. /<b>.*<\/b>/ /<b>.*<\/b>/ The? is also the lazy modifier: Friedl, J. Mastering Regular Expressions /<b>.*?<\/b>/ Friedl, J. Mastering Regular Expressions Questions IP Addresses 4.56.123.156 /^(\d+)\.(\d+)\.(\d+)\.(\d+)$/ var octets=ip.match( ); check each octet for being 255