Fall 1998 Formal Language Theory Dr. R. Boyer

Fall 1998 Formal Language Theory Dr. R. Boyer Week Five: Regular Languages; Pumping Lemma 1. There are algorithms to answer the following questions: (1) given a DFA M and a string w; is w 2 L(M)? The algorithm is linear in the length of the string. (2) givenadfa M; is L(M) =;? (3) givenadfa M; is L(M) =? (4) given two DFA's M 1 and M 2 ; is L(M 1 ) L(M 2 )? (5) given two DFA's M 1 and M 2 ; is L(M 1 )=L(M 2 )? Note: this question can be solved either by using (4) and by using the uniqueness of the minimal state equivalent DFA. (6) givenanfa M and a string w; is w 2 L(M)? There is an algorithm which is polynomial in the length of the string. 2. Proposition. If L = L(M); where M is a NFA, then L is a regular language; that is, there is a regular expression r such that L = L(r): We shall study two constructions for this correspondence. The rst is a graph based algorithm that builds up the regular expression as nodes are deleted from the state diagram. The second method is given as interpreting the regular language as the solution of a system of equations. 3. To present the graph oriented algorithm, we need to introduce an even more general notion of NFA. It will have the expanded property that its edges may be labeled by regular expressions, not simply by a 2 ore: We will denote this new class of automata by GNFA. Method of Acceptance: in an usual NFA, the machine matches an input symbol with an edge label in order to make amove. For a GNFA, the automaton will consume, perhaps, more than one symbol in order to make the next move. It will consume a substring that belongs to the regular language which is denoted by the edge label. 1

A detailed description of this mode of acceptance follows. Let M be a GNFA, and let w 2 : If w 2 L(M); then w = w 1 w 2 :::w k and there is a sequence of states q 0 = s; q 1 ;:::;q k = f; such that w i 2 L(R i ); where (q i,1 ;q i )=R i ; where R i is the regular expression label. Normalization Condition: we shall assume that the GNFA M has a start state s that has NO edge coming into it and that M has a unique nal state ffg with NO edges leaving it. Further, assume f 6= s: Also, it will be convenient to write the transitions in the form (q; q 0 )=R; where q and q 0 are states and R is a regular expression. 4. Graph Theoretical Algorithm of converting NFA into an equivalent regular expression. Step 1: Convert the given NFA M into a GNFA M 0 byintroducing a new start state, new nal state, and the necessary transitions. Suppose M 0 has k states. Step 2: If k =2; then M 0 has just the start state and the unique nal state. It is clear that L(M 0 )=L(R); where R = (s; f): Step 3: \Node Reduction Step" If k > 2; we remove a node to produce an equivalent automaton M 00 : In particular, select a node q 00 6= s; f: Then dene M 00 =(Qnfq 00 g; ; 00 ;s;ffg) where 00 (q i ;q j )= 0 (q i ;q j ) if either 0 (q i ;q 00 )=; or 0 (q 00 ;q i )=;; otherwise, 00 (q i ;q j )=R 1 (R 2 ) R 3 [ R 4 ; if R 1 = 0 (q i ;q 00 );R 2 = 0 (q 00 ;q 00 );R 3 = 0 (q 00 ;q j ); and R 4 = 0 (q i ;q j ): Step 4: Repeat Step 3 until M 00 has two states: s and f: We need to verify that Step 3 does indeed produce an equivalent automaton. We can argue by visualizing paths through the state diagram of the GNFA. In particular, it is sucient to observe that if a GNFA M 3 ; with transition 3 ; had just three states q 1 ;q 2 and q 3 ; then there is an equivalent GNFA M 2 ; with transition 2 ; with two states, where: 2 (q 1 ;q 2 )=R 1;3 (R 3;3 ) R 3;2 [ R 1;2 : 2

Here, we write R i;j for 3 (q i ;q j ): We can describe this method as an algorithm as follows. We let G be a GNFA with start state s and unique accepting state f; with the remaining states given as q 1 ;q 2 ;:::;q n : The algorithm given below successively removes the states q 1 ;q 2 ; and so on, one at a time, producing an equivalent GNFA. When the looping terminates, the resulting GNFA has only two states s and f: The language that it will accept is denoted by the regular expression given by the label (s; f): for k =1::n do for i; j =1::n do new(q i ;q j ):=(q i ;q j ) [ (q i ;q k ) (q k ;q k ) (q k ;q j ) od; for i = k +1::n do new(s; q j ):=(s; q j ) [ (s; q k ) (q k ;q k ) (q k ;q j ); new(q i ;f):=(q i ;f) [ (q i ;q k ) (q k ;q k ) (q k ;q j ) od; new(s; f) :=(s; f) [ (s; q k ) (q k ;q k ) (q k ;f); := new; od; We next compare the above graph algorithm with the approach taken in the textbook. We should compare this algorithm with Dijkstra's algorithm for solving the "single-source shortest path" problem that you studied in algorithms. As usual, the language L is accepted by the DFA M =(fq 1 ;:::;q n g; ;;q 1 ;F): We let R k ij denote the set of all strings that take the automaton M from state q i to state q j without going through any state numbered k or larger; that is, 3

only strings are allowed that start at q i and end at q j and only use states q 1 ;:::;q k,1 as intermediate states. Such sets of strings satisfy an important inductive identity: R k+1 i;j = R k i;j [ R k i;k(r k k;k) R k k;j : To understand this identity, consider the following. Any string w that is contained in R k+1 ij either uses state q k or not. If w does not use state q k ; then it must lie in R k ij : So, we must examine how state q k is used by the string w: Of course, q k must be used as some intermediate state. So, w uses a path from state q i to q k and from q k to q j ; such that only states q 1 ;:::;q k are used as intermediate states. So, it seems that we must add the set R k ik Rk kj : In fact, there is a further possibility. We can also use paths that cycle through state q k : So, the set of strings that use state q k is: R k ik (Rk kk ) R k kj : One detail to consider is that (R k kk ) is properly larger than R k ij itself, because q k is not allowed as an intermediate state for strings from R k ij : We can easily describe the set of strings R 1 ij : For i 6= j; R1 ij = fa : (q i;a)= q j g; while for i = j; R 1 ii = fa : (q i;a)=q i g[feg: We show by induction that R k ij is denoted by a regular expression rk ij : For k =1; we let, for i 6= j; rij 1 = a 1 [ :::[ a p ; for i = j; rii 1 = e [ a 1 [ :::[ a p ; where (q i ;a`) =q j ; 1 ` p: If this set is empty, then rij 1 = ;; for i 6= j; while for i = j; rii 1 = e: We conclude that R1 ij = L(r1 ij ): We now assume that the result holds for value k: That is, for any set R k`m ; there exists a regular expression rk`m such that Rk`m = L(rk`m ): We need to show that any set R k+1 ij is given by a regular expression. By the inductive property of the set R k+1 ij ; we know that R k+1 ij = R k ij [ Rk ik (Rk kk ) R k (k; j): 4

By the induction hypothesis, we have: R k ij [ R k ik (R k kk) R k kj = L(r k ij) [ L(r k ik)l((r k kk) L(r k kj) This nishes the induction. = L(r k ij [ r k ik((r k kk) r k kj): Finally, we observe that L(M) = S fr n+1 1j : q j 2 F g: Algorithm for the computation of the regular expression r k+1 i;j : Note: L(r k+1 i;j )=R k+1 i;j : We shall write r(i; j; k + 1) for rk+1 i;j : function r(i; j; k +1) if k =0then case i = j : RETURN( (q i ;q j ) [feg) case i 6= j : RETURN( (q i ;q j )) else r(i; j; k +1):=r(i; j; k) [ r(i; k; k) r(k; k; k) r(k; j; k) ; end Note: the length of the regular expression will be exponentially long relative to the number of states of the DFA. Example: State a b 1 3 2 2 1 3 3 2 2 The accepting states are fq 2 ;q 3 g and the initial state is fq 1 g: 5

Sample Calculation for r13 4 : We rst nd that r 4 13 = r 3 13 [ r13(r 3 33) 3 r 3 33 : Next, we compute r 3 13 = r2 13 [ r2 12 (r2 22 ) r 2 23 and r 3 33 = r2 33 [ r2 32 (r2 22 ) r 2 23 : We must now expand the regular expressions: r 2 13 ;r2 22 ;r2 23 ;r2 33 ; and r2 32 : We nd that: r 2 13 = r 1 13 [ r 1 11(r 1 11) r 1 13 and r 2 22 = r 1 22 [ r 1 21(r 1 11) r 1 12 : r 2 23 = r 1 23 [ r 1 21(r 1 11) r 1 13 and r 2 33 = r 1 33 [ r 1 31(r 1 11) r 1 13 : Finally, r 2 32 = r 1 32 [ r 1 31(r 1 11) r 1 12 : The base cases which are read o the state diagram of the DFA are given as follows: r 1 = 11 e; r1 = 12 a; r1 = 13 b; r 1 21 = a; r 1 22 = e; r 1 23 = b; r 1 31 = ;; r 1 32 = a [ b; r 1 33 = e: 6

It is more ecient to arrange the calculation of r k+1 ij in the form of a table. Reg. Expr. k =1 k =2 k =3 r k 11 r k 12 r k 13 r k 21 r k 22 r k 23 e a b a e b r k 31 ; r k 32 r k 33 a [ b e 5. Another method of nding a regular expression equivalent to a nite automaton treats the problem as one of solving a system of equations for the language, where concatenation plays the role of multiplication and union the role of addition. Let X q = fx 2 : (q; x) 2 F g: Then we nd that: X X X q = ax (q;a) :ifq=2 F; while = ax (q;a) + ; if q 2 F: a2 a2 This is the linear system for the sets X q 's we mentioned above. For regular languages, we need what is known as Arden's Lemma: Arden's Lemma: Let A; B with e=2 A: Then the equation: X = A X [ B has the unique solution X = A B: 7

Step 1: If X is a solution, then A B X: To see this, note that A B =(A + [ e)b = A + B [ B = A(A B) [ B: Step 2: X A B: By Step 1, X = A B [ C; since A B X with C \ A B = ;: We want to show that C = ;: Now X = AX [ B; so A B [ C = A(A B [ C) [ B = A + B [ AC [ B = A + B [ B [ AC = (A + [ e)b [ AC = A B [ AC: Next, consider the relation: (A B [ C) \ C =(A B [ AC) \ C: Then C = AC \ C; so C AC: Since e 6= A; the shortest string in AC must be longer than the shortest string in C: Hence, AC = C = ;: We conclude: A B is the unique solution. Note: If e 2 A; then the solution A B is no longer unique but it is the smallest solution. 6. We now present a useful theoretical result that states that regular languages must obey a certain type of \periodicity" property. It is used to show that certain simple languages cannot be regular. Pumping Lemma. Let M =(K; ;;s;f)beadfa, with L = L(M): Suppose m = jkj: Let w 2 L(M) with jwj m: Then there are strings x; y; and z such that w = xyz; jxyj m; y 6= e; and xy k z 2 L; 8k 0: We call m; the pumping constant. Idea of the Proof. Any string accepted by M whose length is greater than the number of states of the machine must have aloopinit. It is precisely this loop that can be iterated. We use the contrapositive form of the pumping lemma to show that a language is NOT regular - 8

Let L be a language. Suppose that there exists a string w with substrings x; y; z such that y 6= e; w = xyz; and xy k z=2 L; for some integer k 0; then L cannot be a regular language. So, to show a language is NOT regular, think of playing the following sort of game: nd a string w 2 L so that for any non-empty substring y of w; there exists some pumped form of w : xy k z so that xy k z=2 L: Examples. (1) L 1 = fa n b n : n 1g is not regular. Suppose the language were regular. Choose n greater than the pumping constant given above. Then w = a n b n can be factored as xyz; y 6= e; and xy k z 2 L 1 ; for all k 0: Choose k =0; so xz 2 L; but xz = a n,jyj b n 2 L: Contradiction. We say that we pumped "down" in this example. (2) L 2 = fa n2 : n 1g is not regular. Suppose L 2 were regular. Choose n greater than the pumping constant m: Then a n2 = xyz; where y 6= e and jyj m n: So, xy k z 2 L 2 ; for all k 0: Choose k =2: Then xy 2 z 2 L 2 implies jxy 2 zj is a perfect square. But n 2 < jxy 2 zj <n 2 + n<(n +1) 2 : Contradiction. In this example, we say that we pumped "up." (3) L 3 = fw w R : w 2 g is not regular if = fa; bg: Suppose L 3 were regular. Choose the string w so jwj m +1; where m is the pumping constant. Further, we may choose w to have the special form: w = a m b; so ww r = a m bba m : By the Pumping Lemma, ww R = xyz; with jxyj m and xy k z 2 L 3 ; for k 0: Take k =0: Then xz = a m,jyj bba m 2 L 3 : Contradiction. 7. Problem: Given a DFA M; nd an equivalent DFA with a minimum number of states. We present two solutions to this problem. The rst one of algorithmic. The second one is more conceptual and proves that the equivalent minimum state 9

DFA is unique, up to the labeling of its states. 8. First Method: Merging of Equivalent States Let M =(K; ;;q 0 ;F)beaDFA. Given two states q and q 0 from K; we dene an equivalence relation on the states of M by: q q 0 means (q; w) 2 F () (q 0 ;w) 2 F; 8w 2 : The -equivalence classes are computed by a sequence of other equivalence relations n by successive renements. Let q and q 0 be two states of M: Then: q 0 q 0 means q 2 F () q 0 2 F: That is, 0 has two equivalence classes: the set of accepting states F and the set of rejecting states Q n F: For n>0; dene n+1 to mean: q n+1 q 0 as q n q 0 and (q; a) n (q 0 ;a); 8a 2 : That is, q n q 0 means (q; w) 2 F () (q 0 ;w) 2 F; for all strings w whose length is less than or equal n: The equivalence classes of n stabilize for n less than or equal to the number of states of the automaton M: Further, q q 0 if and only if q n q 0 ; for all n: Now, if all the states of M are reachable from the start state q 0 and if equivalent states are merged, then the resulting automaton has a minimum number of states. These observations give rise to an eective algorithm to nd the minimum state automaton, by successively computing the n -equivalence classes, for n = 0; 1; 2;:::: The process terminates when the equivalence classes for two successive values of n agree. Algorithm for Merging Equivalent States: We rst make a table of unordered pairs of distinct states. No pair is marked. (1) First, mark all pairs of inequivalent states relative to strings of length 0; so mark the pair fp; qg if p 2 F; q 2 Q n F or p 2 Q n F; q 2 F: 10

(2) Next, we mark all pairs of inequivalent states relative to strings of length k =1; 2; ::; n; where n is the total number of states of the original DFA. for k =1::n do if there is an unmarked pair fp; qg, so that f(p; );(q; )g is marked, then mark the pair fp; qg: od; (3) When the loop terminates, all inequivalent pairs are marked; so the unmarked pairs are equivalent states. Merge these pairs together. Example. State a b 0 1 2 1 3 4 2 4 3 3 5 5 4 5 5 5 5 5 The accepting states are f1; 2; 5g: The result of the algorithm is seen to be that states 1 and 2 should be merged and states 3 and 4 should be merged as well. 11

f0; 3g! a f1; 5g; f0; 3g! b f2; 5g: f0; 4g! a f1; 5g; f0; 4g! b f2; 5g f1; 2g! a f3; 4g; f1; 2g! b f3; 4g: f1; 5g! a f3; 5g; f1; 5g! b f4; 5g: f3; 4g! a f5; 5g; f3; 4g! b f5; 5g: f2; 5g! a f4; 5g; f2; 5g! b f3; 5g: 9. Second Method: Construction of the Minimum State DFA directly from the Language L Let M =(K; ;;q 0 ;F) be a nite deterministic automaton such that all its states are reachable from its start state. Let L = L(M) be the language it accepts. We associate with M a special equivalence relation R M on ; where xr M y () (q 0 ;x)=(q 0 ;y); where x; y 2 that is, two strings x and y are equivalent if they terminate at the same state. Hence, we can identify the R M equivalence classes [x] M with the sates of M: The language L(M) is the union of the R M -equivalence classes which include an element x; so (q 0 ;x) 2 F: We may call R M ; machine equivalence. We call an equivalence relation R on right-invariant if xry ) xzryz; for all strings z 2 : Note: R M is right invariant. Let L be any language over the alphabet ; that is, L : We can associate an equivalence relation R L on directly from L; without using a nite automaton. 12

Given any two strings x; y 2 ; we say xr L y yz 2 L; for all z 2 : () xz 2 L exactly when Note: the equivalence relation R L is right invariant and R L is a renement of R M ; if L = L(M); foradfa M: 10. We can construct a deterministic nite automaton M L directly from the equivalence relation R L ; if R L has FINITE index; that is, if the number of R L equivalence classes is nite. We set M L =(K L ; ; L ;s L ;F L ): Let K L ; the states of the machine M L ; be the collection of all R L -equivalence classes; write them as [x] L ; for a string x: The transition function L : K L! K L is given as: L ([x] L ;a)=[xa] L : Note: L is well dened. Set s L =[e] L and F L = f[x] L : x 2 Lg: The minimum state automaton accepting L is given by M L ; further, any other minimum state automaton that accepts L can be identied with M L ; by a re-labeling of its states. 13