Regular Sets and Expressions

Regulr Sets nd Expressions Finite utomt re importnt in science, mthemtics, nd engineering. Engineers like them ecuse they re super models for circuits (And, since the dvent of VLSI systems sometimes finite utomt re circuits!) Computer scientists dore them ecuse they dpt very nicely to lgorithm design, for exmple the lexicl nlysis portion of compiling nd trnsltion. Mthemticins re intrigued y them too due to the fct tht there re severl nifty mthemticl chrcteriztions of the sets they ccept. This is prtilly wht this section is out. We shll uild expressions from the symol, 1, +, nd & using the opertions of union, conctention, nd Kleene closure. Severl intuitive exmples of our nottion re: ) 01 mens zero followed y one (conctention) ) 0+1 mens either zero or one (union) c) 0 * mens ^ + 0 + 00 + 000 +... (Kleene closure) With prentheses we cn uild lrger expressions. And we cn ssocite menings with our expressions. Here's how: Expression Set Represented (0+1) * ll strings over {0,1}. 0 * 10 * 10 * strings contining exctly two ones. (0+1) * 11 strings which end with two ones. Tht is the intuitive pproch to these new expressions or formuls. Now for precise, forml view. Severl definitions should do the jo. Definition. 0, 1, ε, nd re regulr expressions. Definition. If α nd β re regulr expressions, then so re (αβ), (α + β), nd (α) *. OK, fine. Regulr expressions re strings put together with zeros, ones, epsilons, strs, plusses, nd mtched prentheses in certin wys. But why did we do it? And wht do they men? We shll nswer this with list of wht vrious generl regulr expressions represent. First, let us define wht some specific regulr expressions represent.

Regulr Sets 2 ) 0 represents the set {0} ) 1 represents the set {1} c) ε represents the set {ε} (the empty string) d) represents the empty set Now for some generl cses. If α nd β re regulr expressions representing the sets A nd B, then: ) (αβ) represents the set AB ) (α + β) represents the set A B c) (α) * represents the set A * The sets which cn e represented y regulr expressions re clled regulr sets. When writing down regulr expressions to represent regulr sets we shll often drop prentheses round conctentions. Some exmples re 11(0 + 1) * (the set of strings eginning with two ones), 0 * 1 * (ll strings which contin possily empty sequence of zeros followed y possily null string of ones), nd the exmples mentioned erlier. We lso should note tht {0,1} is not the only lphet for regulr sets. Any finite lphet my e used. From our precise definitions of the regulr expressions nd the sets they represent we cn derive the following nice chrcteriztion of the regulr sets. Then, very quickly we shll relte them to finite utomt. Theorem 1. The clss of regulr sets is the smllest clss contining the sets {0}, {1}, {ε}, nd which is closed under union, conctention, nd Kleene closure. See why the ove chrcteriztion theorem is true? And why we left out the proof? Anywy, tht is ll rther net ut, wht exctly does it hve to do with finite utomt? Theorem 2. Every regulr set cn e ccepted y finite utomton. Proof. The singleton sets {0}, {1}, {ε}, nd cn ll e ccepted y finite utomt. The fct tht the clss of sets ccepted y finite utomt is closed under union, conctention, nd Kleene closure completes the proof. Just from closure properties we know tht we cn uild finite utomt to ccept ll of the regulr sets. And this is indeed done using the constructions

Regulr Sets 3 from the theorems. For exmple, to uild mchine ccepting ( + ) *, we design: M which ccepts {}, M which ccepts {}, M + which ccepts {, } (from M nd M ), M * which ccepts *, nd so forth until the desired mchine hs een uilt. This is esily done utomticlly, nd is not too d fter the finl mchine is reduced. But it would e nice though to hve some lgorithm for converting regulr expressions directly to utomt. The following lgorithm for this will e presented in intuitive terms in lnguge reminiscent of lnguge prsing nd trnsltion. Initilly, we shll tke regulr expression nd rek it into suexpressions. For exmple, the regulr expression ( + ) * () * cn e roken into the three suexpressions: ( + ) *,, nd () *. (These cn e roken down lter on in the sme mnner if necessry.) Then we numer the symols in the expression so tht we cn distinguish etween them lter. Our three suexpressions now re: ( 1 2 + 1 ) *, 3 2, nd ( 3 4 ) *. Symols which led n expression re importnt s re those which end the expression. We group these in sets nmed FIRST nd LAST. These sets for our suexpressions re: Expression FIRST LAST ( 1 2 + 1 ) * 1, 1 2, 1 3 2 3 2 ( 3 4 ) * 3 4 Note tht since the FIRST suexpression contined union there were two symols in its FIRST set. The FIRST set for the entire expression is: { 1, 3, 1 }. The reson tht 3 ws in this set is tht since the first suexpression ws strred, it could e skipped nd thus the first symol of the next suexpression could e the first symol for the entire expression. For similr resons, the LAST set for the whole expression is { 2, 4 }. Forml, precise rules do govern the construction of the FIRST nd LAST sets. We know tht FIRST() = {} nd tht we lwys uild FIRST nd LAST sets from the ottom up. Here re the remining rules for FIRST sets.

Regulr Sets 4 Definition. If α nd β re regulr expressions then: ) FIRST(α + β) = FIRST(α) FIRST(β) ) FIRST(α*) = FIRST(α) {ε} FIRST(α) if ε FIRST(α) c) FIRST(αβ) = FIRST(α) FIRST(β) otherwise Exmining these rules with cre revels tht the ove chrt ws not quite wht the rules cll for since empty strings were omitted. The correct, complete chrt is: Expression FIRST LAST ( 1 2 + 1 ) * 1, 1, ε 2, 1, ε 3 2 3 2 ( 3 4 ) * 3, ε 4, ε Rules for the LAST sets re much the sme in spirit nd their formultion will e left s n exercise. One more notion is needed, the set of symols which might follow ech symol in ny strings generted from the expression. We shll first provide n exmple nd explin in moment. Symol 1 2 3 1 2 3 4 FOLLOW 2 1, 3, 2 1, 3, 1 3 4 3 Now, how did we do this? It is lmost ovious if given little thought. The FOLLOW set for symol is ll of the symols which could come next. The lgorithm goes s follows. To find FOLLOW(), we keep reking the expression into suexpressions until the symol is in the LAST set of suexpression. Then FOLLOW() is the FIRST set of the next suexpression. Here is n exmple. Suppose tht we hve αβ s our expression nd know tht LAST(α). Then FOLLOW() = FIRST(β). In most cses, this is the wy it we compute FOLLOW sets.

Regulr Sets 5 But, there re three exceptions tht must e noted. 1) If n expression of the form γ* is in α then we must lso include the FIRST set of this strred suexpression γ. 2) If α is of the form β* then FOLLOW() lso contins α's FIRST set. 3) If the suexpression to the right of α hs n ε in its FIRST set, then we keep on to the right unioning FIRST sets until we no longer find n ε in one. Another exmple. Let's find the FOLLOW set for 1 in the regulr expression ( 1 + 1 2 * ) * 2 * (3 + 3 ). First we rek it down into suexpressions until 1 is in LAST set. These re: ( 1 + 1 2 * ) * 2 * ( 3 + 3 ) Their FIRST nd LAST sets re: Expression FIRST LAST ( 1 + 1 * 2 ) * 1, 1, ε 1, 1, 2, ε 2 * 2, ε 2, ε ( 3 + 3 ) 3, 3 3, 3 Since 1 is in the LAST set of suexpression which is strred then we plce tht suexpression's FIRST set { 1, 1 } into FOLLOW( 1 ). Since * 2 cme fter 1 nd ws strred we must include 2 lso. We lso plce the FIRST set of the next suexpression ( * 2 ) in the FOLLOW set. Since tht set contined n ε, we must put the next FIRST set in lso. Thus in this exmple, ll of the FIRST sets re comined nd we hve: FOLLOW( 1 ) = { 1, 1, 2, 2, 3, 3 } Severl other FOLLOW sets re: FOLLOW( 1 ) = { 1, 1, 2, 3, 3 } FOLLOW( 2 ) = { 2, 3, 3 } After computing ll of these sets it is not hrd to set up finite utomton for ny regulr expression. Begin with stte nmed. Connect it to sttes

Regulr Sets 6 denoting the FIRST sets of the expression. (By sets we men: split the FIRST set into two prts, one for ech type of symol.) Our first exmple ( 1 2 + 1 ) * 3 2 ( 3 4 ) * provides: 1,3 1 Next, connect the sttes just generted to sttes denoting the FOLLOW sets of ll their symols. Agin, we hve: 1,3 2 1 2 Continue on until everything is connected. Any edges missing t this point should e connected to rejecting stte nmed s r. The sttes contining symols in the expression's LAST set re the ccepting sttes. The complete construction for our exmple ( + ) * () * is: 1,3 2 1 2 3 s r 4,

Regulr Sets 7 This construction did indeed produce n equivlent finite utomton, nd in not too inefficient mnner. Though if we note tht 2 nd 4 re siclly the sme, nd tht 1 nd 2 re similr, we cn esily stremline the utomton to: 1,3 2,4 2 1 s r 3, Our construction method provides: s 0 1,3 123 123 for our finl exmple. There is very simple equivlent mchine. Try to find it! We now close this section with the equivlence theorem concerning finite utomt nd regulr sets. Hlf of it ws proven erlier in the section, ut the trnsltion of finite utomt into regulr expressions remins. This is not included for two resons. First, tht it is very tedious, nd secondly tht noody ever ctully does tht trnsltion for ny prcticl reson! (It is n interesting demonstrtion of correctness proof which involves severl levels of itertion nd should e looked up y the interested reder.) Theorem 3. The regulr sets re exctly those sets ccepted y finite utomt.