Content of this lecture Regular Expressions in Java 2010-09-22 Birgit Grohe A very small Java program Regular expressions in Java Metacharacters Character classes and boundaries Quantifiers Backreferences Flag Expressions and Modifiers Summary 1 2 Programming in Java Object oriented programming language In some languages, the first step is to write small programs from scratch (e.g. Perl). Learning Java is about to learn how to use objects, classes and packages, often before you write your own. A Java program is first compiled into a.class file, then you can run the program (remember lab1!) Different from Perl where a interpreter takes care of both compilation and execution. 3 Hello, world! In Java public class Hello { public static void main (String[] args){ // Printing to a terminal window System.out.println( Hello, world! ); method >javac Hello.java >java Hello Hello, world! Class definition comment 4 1
Regular Expressions in Java The package java.util.regex consist of classes Pattern, Matcher and PatternSyntaxException. A Pattern object is a compiled representation of a regular expression. A Matcher object is the engine that interprets the pattern and performs match operations against an input string. For syntax errors: PatternSyntaxException. 5 Example The next slide shows Java code for a class for regular expression processing: It reads an input string and a regular expression from the user. The output are the matches, if any. The class is taken from a Java regular expression tutorial: http://download.oracle.com/javase/tutorial/essential/regex/index.html The class will be used in lab 5! 6 Import..; public class RegexTestHarness { public static void main(string[] args){ Console console = System.console(); if (console == null) { System.err.println("No console."); System.exit(1); while (true) { Pattern pattern = Pattern.compile(console.readLine( "%nenter your regex: ")); Matcher matcher = pattern.matcher(console.readline( "Enter input string to search: ")); boolean found = false; while (matcher.find()) { console.format("i found the text \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end()); found = true; if(!found){ console.format("no match found.%n"); From a Java regexp tutorial, see references. 7 Pattern pattern = Pattern.compile(console.readLine( "%nenter your regex: ")); Matcher matcher = pattern.matcher(console.readline( "Enter input string to search: ")); %n newline boolean found = false; %s string %d number while (matcher.find()) { console.format("i found the text \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end()); Enter your regex: foo Enter input string to search: foo I found the text "foo" starting at index 0 and ending at index 3. Enter your regex: cat. metacharacter. Enter input string to search: cats I found the text "cats" starting at index 0 and ending at index 4. 8 2
Metacharacters Character Classes There are characters with a special meaning within regular expressions in Java. *? + [ ] ( ) { ^ $ \- To use their literal meanings: use the escpape symbol\ or the escape sequence\q <text> \E Simple character classes: [abc] Negation: [^abc] negation Ranges: [a-d] [a-dm-p] Union: [a-d[m-p]] Intersection: [a-z&&[def]] Subtraction: [a-z&&[^bc]] [ad-z] d,e or f 9 10 Predefined Character Classes Boundary Matchers Digit: [0-9] or \d Non-digit: [^0-9] or \D Whitespace character: [ \t\n\x0b\f\r] or \s Word character: [a-za-z_0-9] or \w Other negations: \S \W The beginning of a line: ^ The end of a line: $ Word boundary: \b The beginning of the input: \A The end of the previous match: \G The end of the input: \z For more matchers see literature! Interesting since quantifiers in Java work slightly differently compared to Perl. 11 12 3
Greedy X? X* Quantifiers Reluctant Possessive X?? X?+ once or not at all X*? X*+ zero or more times Greedy Quantifiers Enter your regex: a? Enter input string to search: aaaa I found the text "a" starting at index 0 and ending at index 1. I found the text "a" starting at index 1 and ending at index 2. I found the text "a" starting at index 2 and ending at index 3. I found the text "a" starting at index 3 and ending at index 4. I found the text "" starting at index 4 and ending at index 4. Multiple matches! X+ X{n X+? X{n? X++ X{n+ one ore more times X, exactly n times Enter your regex: a* Greedy! Enter input string to search: aaaa I found the text "aaaaa" starting at index 0 and ending at index 4. I found the text "" starting at index 4 and ending at index 4. More alternatives: X{n, and X{n,m 13 Enter your regex: a+? and * match Enter input string to search: aaaa I found the text "aaaaa" starting at index 0 and ending at index 4. 14 Greedy Quantifiers Grouping strings for Enter your regex: (cat){3 quatifiers with ( ) Enter input string to search: catcatcatcatcatcat I found the text catcatcat" starting at index 0 and ending at index 9. I found the text catcatcat" starting at index 9 and ending at index 18. Enter your regex: cat{3 Enter input string to search: catcatcatcatcatcat No match found. Enter your regex: a{3,5 Greedy! Enter input string to search: aaaaaaaa I found the text "aaaaa" starting at index 0 and ending at index 5. I found the text "aaa" starting at index 5 and ending at index 8. Reluctant and Possessive Quantifiers Enter your regex:.*foo // greedy quantifier Enter input string to search: xfooxxxxxxfoo I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13. Tries to finish as Enter your regex:.*?foo // reluctant quantifier early as possible Enter input string to search: xfooxxxxxxfoo I found the text "xfoo" starting at index 0 and ending at index 4. I found the text "xxxxxxfoo" starting at index 4 and ending at index 13. Enter your regex:.*+foo // possessive quantifier Enter input string to search: xfooxxxxxxfoo No match found. Tries only once! 15 16 4
Summary Quantifiers The greedy quatifier tries to match as much as it can until the end of the string is reached. If it fails, it goes back one letter at a time and tries again until a match is found or the start of the input is reached (= no match). The reluctant quantifier tries to match as early as possible, increasing a letter at a time until a match is found or the end of the input string is reached (= no match). Backreferences Backreferences work approximately the same as in Perl, i.e. those parts of the regular expression that are placed in ( ), can be accessed with \1, \2... The possessive quantifier consumes the entire string once and if it did not suceed, it just stops without looking back. Fast performance! 17 18 Modifiers In Java there exist similar features as the modifiers in Perl. There are two possibilities to implement and use them: Embedded Flag expression (the flag is given inside the regular expression) Flags and methods from the Pattern-class (extra code and function calls required) More modifies can be found in the Java Regexp tutorial. Embedded Flag Expressions Example: Case insensitivity: Enter your regex: (?i)foo Enter input string to search: FOOfooFoO I found the text "FOO" starting at index 0 and ending at index 3. I found the text "foo" starting at index 3 and ending at index 6. I found the text "FoO" starting at index 6 and ending at index 9. 19 20 5
Methods from the Pattern Class Example: Case insensitivity Pattern pattern = Pattern.compile( console.readline("%nenter your regex: "), Pattern.CASE_INSENSITIVE); Modify the code! Enter your regex: dog Enter input string to search: DoGDOg I found the text "DoG" starting at index 0 and ending at index 3. I found the text "DOg" starting at index 3 and ending at index 6. Other Modifiers and Flags The Pattern and Matcher classes support similar features that are present in Perl, e.g. split, several different substitution methods (called replacement in Java), comments, line versus file mode, etc. Please read the Java Regexp tutorial for more details! 21 22 Summary Java provides a package for regular expressions: java.util.regex The syntax and usage of regular expressions in Perl and Java are similar. There are minor differences in the regular expression engine, e.g. on how the quantifiers are implemented. Both Java and Perl provide similar features, e.g. classes and functions and you will explore some differences in lab 5. 23 6