Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy



Similar documents
C H A P T E R Regular Expressions regular expression

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Automata and Formal Languages

6.2 Permutations continued

Regular Languages and Finite Automata

Automata and Computability. Solutions to Exercises

Continued Fractions and the Euclidean Algorithm

Fundamentele Informatica II

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

1 Definition of a Turing machine

Automata on Infinite Words and Trees

Lecture 2: Regular Languages [Fa 14]

Reading 13 : Finite State Automata and Regular Expressions

Introduction to Automata Theory. Reading: Chapter 1

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Lemma 5.2. Let S be a set. (1) Let f and g be two permutations of S. Then the composition of f and g is a permutation of S.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

GRAPH THEORY LECTURE 4: TREES

How To Understand The Theory Of Computer Science

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Lecture 2: Universality

Today s Topics. Primes & Greatest Common Divisors

THEORY of COMPUTATION

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Methods for Finding Bases

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Markov random fields and Gibbs measures

CS 103X: Discrete Structures Homework Assignment 3 Solutions

Computability Theory

Finite Automata and Regular Languages

The Prime Numbers. Definition. A prime number is a positive integer with exactly two positive divisors.

MATH10040 Chapter 2: Prime and relatively prime numbers

Galois Theory III Splitting fields.

1 if 1 x 0 1 if 0 x 1

Lecture 3: Finding integer solutions to systems of linear equations

Mathematical Induction. Lecture 10-11

CS5236 Advanced Automata Theory

Regular Expressions and Automata using Haskell

The Goldberg Rao Algorithm for the Maximum Flow Problem

8 Divisibility and prime numbers

Turing Machines: An Introduction

Math 319 Problem Set #3 Solution 21 February 2002

INCIDENCE-BETWEENNESS GEOMETRY

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Local periods and binary partial words: An algorithm

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Full and Complete Binary Trees

THE BANACH CONTRACTION PRINCIPLE. Contents

A simple criterion on degree sequences of graphs

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

The Ergodic Theorem and randomness

Notes on Complexity Theory Last updated: August, Lecture 1

CONTINUED FRACTIONS AND PELL S EQUATION. Contents 1. Continued Fractions 1 2. Solution to Pell s Equation 9 References 12

k, then n = p2α 1 1 pα k

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

1. Prove that the empty set is a subset of every set.

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

This asserts two sets are equal iff they have the same elements, that is, a set is determined by its elements.

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Total Degrees and Nonsplitting Properties of Σ 0 2 Enumeration Degrees

University of Ostrava. Reasoning in Description Logic with Semantic Tableau Binary Trees

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

5.3 Improper Integrals Involving Rational and Exponential Functions

Section IV.1: Recursive Algorithms and Recursion Trees

Lecture 5 - CPA security, Pseudorandom functions

On strong fairness in UNITY

Testing LTL Formula Translation into Büchi Automata

1 Approximating Set Cover

1 Operational Semantics for While

Kevin James. MTHSC 412 Section 2.4 Prime Factors and Greatest Comm

A characterization of trace zero symmetric nonnegative 5x5 matrices

The Fundamental Theorem of Arithmetic

Computational Models Lecture 8, Spring 2009

CHAPTER 5. Number Theory. 1. Integers and Division. Discussion

1 Solving LPs: The Simplex Algorithm of George Dantzig

Section 4.2: The Division Algorithm and Greatest Common Divisors

Separation Properties for Locally Convex Cones

Network (Tree) Topology Inference Based on Prüfer Sequence

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

Quotient Rings and Field Extensions

NP-Completeness and Cook s Theorem

Degrees that are not degrees of categoricity

PRIME FACTORS OF CONSECUTIVE INTEGERS

Stochastic Inventory Control

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

Elementary Number Theory and Methods of Proof. CSE 215, Foundations of Computer Science Stony Brook University

Compiler Construction

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Monitoring Metric First-order Temporal Properties

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

Metric Spaces Joseph Muscat 2003 (Last revised May 2009)

Implementation of Recursively Enumerable Languages using Universal Turing Machine in JFLAP

Systems of Linear Equations

Transcription:

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy Kim S. Larsen Odense University Abstract For many years, regular expressions with back referencing have been used in a variety of software products in common use, including operating systems, editors, and programming languages. In these products, regular expressions have been extended with a naming construct. If a subexpression is named by some variable, whenever this subexpression matches some string, that string is assigned to the variable. Later occurrences of the same variable will then match the string assigned to it. This construction greatly increases the power of regular expressions and is useful in text searching as well as in text substitution in large documents. We study the nested usage of this operator, and prove that the power of the expressions increase with the number of nested levels that are allowed. Keywords: Formal languages; regular expressions; back referencing. 1 Introduction For many years, regular expressions with back referencing have been used in a variety of software products in common use, including operating systems, editors, and programming languages. However, it seems that a theoretical study of this language construction has not been made. In [2], a vaguely related class of expressions are studied. In the terminology of our paper, these are regular expressions without alternation and with back referencing being restricted to using one variable name only. Using the semantics of regular expressions with back referencing as outlined in [1], we define nested levels of back referencing and prove that the number of languages that can be expressed increases with the number of nested levels that are allowed. Supported in part by the esprit Long Term Research Programme of the EU under project number 20244 (alcom-it). Department of Mathematics and Computer Science, Odense University, Campusvej 55, DK 5230 Odense M, Denmark. Email: kslarsen@imada.ou.dk. 1

2 Regular expressions with back referencing We choose a standard syntax for regular expressions with and without back referencing [1]. Let Σ, the alphabet, be an infinite 1 set of symbols, and let Ψ be an infinite set of names. The following grammar defines the syntax of regular expressions with back referencing. where a Σ and α Ψ. E ::= a E E E E E E%α α (E) The precedence of the operators from highest to lowest is as follows: naming (E%α), Kleene star (E ), concatenation (E E), and alternation (E E). As usual, we often write EE instead of E E, i.e., leaving out the dot. If E is a regular expression with back referencing, we use L(E) to denote the language it defines. A complete definition of the semantics of regular expressions with back referencing can be found in [1]. The semantics of the operators concatenation, alternation, and star are as usual for regular expressions. Informally, the semantics of naming and use of variables is best explained from an operational point of view. If we try to parse a string, from left to right, according to a regular expression with back referencing, whenever a substring is matched with a named subexpression, that substring is assigned to the variable in question. A later occurrence of this variable matches the string assigned to it. Before continuing, let us give a few examples (partly from [1]) of the semantics of these expressions: (a b) %α α α is the language {www w (a b) }. (a b c) (a b c)%α (a b c) α (a b c) is the language of strings over {a,b,c} with at least one repeated occurrence of a symbol. (a b) %α (a b) %β (α β) is the language u,v (a b) {uuv,uvv}. ((a b) %α α) isthesetofallstringsover{a,b}oftheforms 1 s 1 s 2 s 2 s k s k for some k, where the s i s can be any strings of a s and b s. In the last example, notice that in each iteration of the outermost star expression, a new value is bound to the variable α. For convenience, we use e i as an abbreviation for e e (i times) e, and we assume that names are not reused, i.e., in any expression e, for any α Ψ, there is at most one subexpression of the form e %α. This does not restrict the expressive power and could be avoided at the cost of more cumbersome definitions and proofs. 1 Usually, the alphabet for regular expressions is finite. Here, our goal is to establish an infinite hierarchy, and we need infinitely many symbols. However, each expression that we consider is finite and uses only a finite part of the alphabet. For purists, we want to remark that we could instead have used an infinite family of finite alphabets. 2

There are the following additional requirements, which are necessary in order to ensure that the use of variables is well-defined: If α appears in an expression e, then there must exist a subexpression e of e of the form e = e 1 e 2, where the α under discussion is a subexpression of e 2, and an expression of the form e %α is a subexpression of e 1. Assume that e %α is a subexpression of e. If there exists a subexpression e = e 1 e 2 of e such that e is a subexpression of e i, where i = 1 or i = 2, then the name α is only allowed to appear in e i. Assume that e %α is a subexpression of e. If there exists a subexpression e = e 1 of e such that e is a subexpression of e 1, then α is only allowed to appear in e 1. The first requirement ensures that a name is defined before it is used, and the last two requirements ensure that a string has been assigned to it. The nested level of a regular expression with back referencing is defined recursively in the structure of the expression. Regular expressions have nested level zero. The nested level of E%α is one more than the nested level of E. The nested level of a variable α is the nested level of its defining expression E%α. For all other operators, the nested level of the expression is the maximum of the nested levels of the arguments. In the following sections, we use contexts [3]. These are merely expressions with a hole instead of a symbol. As an example, C[] = (a []) is a context. If C[] is a context, C[E] is the expression C[] with the hole replaced by E. In the example, C[b c] is the expression (a (b c)). We use σ(e) (the signature of E) to denote the set of symbols from Σ occurring in an expression E. 3 The hierarchy Let L i be the set of languages which can be expressed by a regular expression with back referencing of at most i nested levels. Thus, L 0 equals the set of regular languages. Clearly, L i L j, if i j. In this section, we prove that L i L j if i j. AssumethatΣ = {a l 0,am 0,ar 0,al 1,am 1,ar 1,al 2,am 2,ar 2,...}andΨ = {α 0,α 1,α 2,...}. The superscripts l, m, and r stand for left, middle, and right. Define the sequence of expressions {x i } i 0 as follows: x 0 = (a l 0 am 0 ar 0 ) and for i 1, x i = (a l i x i 1%α i 1 a m i α i 1 a r i ). Clearly, L(x i ) L i. In this section, we show that for all i > 0, L(x i ) L i 1. Thus, the sequence of sets of languages forms a hierarchy. If a regular expression with back referencing R defines the language L(x k ) for some k, then we can prove that R must have a certain form. Basically, the 3

following lemma breaks up the expression R. Looking at R as a syntax tree, it says that there must exist a path from the root to a leaf on which there are k+1 stars with the property that the symbols a d i are introduced gradually going up this path (there could be more than k +1 stars total; the statement is merely that there exists k + 1 stars with this property). More precisely, only symbols a d j with j < i are present in the subexpression under the ith of these stars going from the the leaf and up. Lemma 1 Let R be a regular expression with back referencing such that L(R) = L(x k ). Then there exists expressions R i and contexts C i, 1 i k+1, such that R can be written as follows: R = C k+1 [R k ] R i = C i [R i 1 ], i {1,...,k} R 0 = C 0 [a m 0 ] where σ(r 0 ) = {a l 0,am 0,ar 0 } and σ(r i) = σ(r i 1 ) {a l i,am i,ar i }, i {1,...,k}. Proof Let y j be as x k, except that all occurrences of stars in x k are replaced by j s, i.e., any subexpression e is replaced by e j. Let M = j L(y j ). Clearly, M L(x k ) and M is infinite. We consider all possible R i s and C i s and show that if they are not of the right form, then either strings not in L(x k ) can be produced (which is a contradiction), or this part of the expression R can only contribute with finitely many of the substrings in M. This also leads to a contradiction since there are only a finite number of possibilities for choosing the R i s and C i s (which is the same as saying that there are only a finite number of different paths from the root to a leaf in the syntax tree of a finite expression). For convenience, define R k+1 = R. Now, assume for the sake of contradiction that the lemma does not hold. Let p {0,...,k} be the largest integer for which we can find expressions and contexts of the right form, i.e., for 1 i p, we have R i = C i [R i 1 ] and R 0 = C 0 [a m 0 ], where σ(r 0) = {a l 0,am 0,ar 0 } and σ(r i ) = σ(r i 1 ) {a l i,am i,ar i }. We consider the different possibilities for why this sequence cannot be extended to one fulfilling the lemma. Consider the smallest context C[] such that R = C [(C[R p ]) ] and σ(c[]) \ σ(r p ) (the context C [] is just the rest of R after C[] has been defined). If no such context exists, either there is no star containing the subexpression R p, in which case this path can only produce a bounded number of a d p+1 s, or no symbols with indices greater than p are introduced at all. In any case, this path s contribution to M is finite. If such a context C[] does exist, then, by the choice of p, a symbol with index greater than or equal to p+2 is introduced, i.e., belongs to σ(c[])\σ(r p ). But then the strings will not have the correct form unless symbols with index p+1 are introduced at the same time. However, since C[] is chosen to be the smallest context fulfilling the requirements listed above, it is not possible to produce an unbounded number of symbols with index p+1 in between symbols with index p+2, so again only finitely many strings from M can originate from this path. 4

Theorem 2 Let R be a regular expression with back referencing, and assume that L(R) = L(x k ). Then R has nested level k. Proof Let R be written as described in lemma 1. We prove that if some R i, i {1,...,k}, is not of the form R i = C A i [(CB i [R i 1 ])%α], for some α, then we get a contradiction. Clearly, this implies that all of these k R i s are of this form, so the nested level is k. Now, assume to the contrary that R i is not of the form described above. We make some transformations on the expression R i (in the remaining part of this proof, we ignore the rest of R). These transformations have the property that if an expression F is transformed into F, then L(F ) L(F). First, we perform the following transformations on R i 1 : Replace an alternation with one of its arguments; if a d i 1, for some d {l,m,r}, is in the signature of one argument and not the other, choose the former. Otherwise, decide arbitrarily. Replace a star with its argument. The transformations are carried out repeatedly until none of them can be applied any more. After this, let E refer to the expression that R i 1 has been transformed into. Now the remaining parts of R i are transformed: Replace an alternation with one of its arguments; if E is a subexpression of one of the arguments, that argument is chosen. Otherwise, decide arbitrarily. Replace a star with its argument whenever E is not a subexpression of that argument. The transformations are carried out repeatedly until none of them can be applied any more. The resulting expression has no alternations, and only stars having E as a subexpression. Since, by assumption, there were no namings between R i 1 and R i, this means that in the transformed expression, subexpressions of namings only contain concatenation as an operator. Now replace all names with the subexpression which has been given that name, and remove the naming. Since the named subexpressions only contain concatenation, each subexpression is in fact simply a string, so this transformation preserves the semantic meaning of the expression. After this, let F refer to the transformed expression. Thus, F is of the form (C [E ]). Since L(F) L(R i ) and since R i is a subexpression of R, the strings in L(F) must be of the form x a l iy 1 a m i y 1 a r i a l iy 2 a m i y 2 a r i a l iy 3 a m i y 3 a r i z 5

where a l i σ(x), ar i σ(z) and x, z, and the y i s are some strings. Since stars are only applied to expressions containing E as a subexpression, there must exist a constant p such that L(F) intersected with (Σ\{a l i}) a l it a m i T a r i a l it a m i T a r i (p times) a l it a m i T a r i (Σ\{a r i}), where T = {a l 0,am 0,ar 0,...,al i 1,am i 1,ar i 1 }, is still infinite. This is seen as follows: Let E s be the smallest expression containing E as a subexpression such that R i can be written R i = C s [E s ] and {a l i,am i,ar i } σ(e s). Since E s is the smallest expression with this property, there cannot be any stars in E s applied to a subexpression containing any of the a d i s. If there were, strings would not have the correct form (there would be occurrences of one of the a d i s without both of the other a d i s appearing in between). Removing all stars in C s [] (in other words changing all stars to a 1 for one iteration ), would give us a regular expression with a bounded number, p, of a d i s. Since E s contains a star and σ(e ), the language is still infinite. We now apply a sequential transducer [4] to the result of this intersection. The transducer outputs ε until it meets the first a l i after which it outputs whatever it reads. In this way, we can get rid of the x s and still have a regular language. Since the reverse of a regular language is also regular, we can with a similar transducer remove the z s. Again using a sequential transducer, we change every symbol with a subscript different from i to a new symbol b. The strings in this regular language are of the form a l iy 1 a m i y 1 a r i a l iy 2 a m i y 2 a r i a l iy p a m i y p a r i where the y i s only contain b s. Applying the pumping lemma [5] to this infinite language shows that it is not regular, and we have obtained a contradiction. Corollary 3 If L k is the set of languages which can be expressed by a regular expression with back referencing of at most k nested levels, then i,j IN: i < j L i L j L i L j. Acknowledgment The author would like to thank Peter Høyer for many interesting discussions on this and related issues. References [1] A. V. Aho, Algorithms for Finding Patterns in Strings, in Handbook of Theoretical Computer Science, Vol. A, J. van Leeuwen, ed., Chap. 5, 1073 1156: Algorithms and Complexity (Elsevier Science Publishers, Amsterdam, 1990). [2] D. Angluin, Finding Patterns Common to a Set of String, Proceedings of the 11th Annual ACM Symposium on Theory of Computing (1979) 130-141. 6

[3] H. P. Barendregt, The Lambda Calculus: Its Syntax and Semantics (North- Holland, Amsterdam, 1981). [4] M. A. Harrison, Introduction to Formal Language Theory (Addison-Wesley, Reading, Massachusetts, 1978). [5] J. C. Martin, Introduction to Languages and the Theory of Computation(McGraw- Hill, New York, 1991). 7