Genetic programming with regular expressions



Similar documents
6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Automata and Computability. Solutions to Exercises

3515ICT Theory of Computation Turing Machines

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Turing Machines: An Introduction

Introduction to Automata Theory. Reading: Chapter 1

Regular Languages and Finite State Machines

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

Automata and Formal Languages

A Non-Linear Schema Theorem for Genetic Algorithms

D A T A M I N I N G C L A S S I F I C A T I O N

Model 2.4 Faculty member + student

Finite Automata. Reading: Chapter 2

Course Manual Automata & Complexity 2015

Finite Automata. Reading: Chapter 2

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Compiler Construction

Reading 13 : Finite State Automata and Regular Expressions

Computer Architecture Syllabus of Qualifying Examination

SRM UNIVERSITY FACULTY OF ENGINEERING & TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF SOFTWARE ENGINEERING COURSE PLAN

Regular Expressions and Automata using Haskell

Modeling of Graph and Automaton in Database

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Evolving Data Structures with Genetic Programming

Regular Languages and Finite Automata

CS 301 Course Information

CAs and Turing Machines. The Basis for Universal Computation

GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining

The Halting Problem is Undecidable

CS5236 Advanced Automata Theory

Fast nondeterministic recognition of context-free languages using two queues

Deterministic Finite Automata

Computer Science Theory. From the course description:

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

CS 3719 (Theory of Computation and Algorithms) Lecture 4

THEORY of COMPUTATION

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

6 Creating the Animation

Overview of E0222: Automata and Computability

Finite Automata and Regular Languages

Increasing Interaction and Support in the Formal Languages and Automata Theory Course

Web Data Extraction: 1 o Semestre 2007/2008

6.080/6.089 GITCS Feb 12, Lecture 3

Implementation of Recursively Enumerable Languages using Universal Turing Machine in JFLAP

Grammatical Differential Evolution

Computational Models Lecture 8, Spring 2009

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Increasing Interaction and Support in the Formal Languages and Automata Theory Course

Genetic Algorithm Evolution of Cellular Automata Rules for Complex Binary Sequence Prediction

Math 115 Spring 2011 Written Homework 5 Solutions

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Mathematical Induction. Lecture 10-11

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Software Engineering and Service Design: courses in ITMO University

Genetic Algorithms and Sudoku

Alpha Cut based Novel Selection for Genetic Algorithm

Cellular Automaton: The Roulette Wheel and the Landscape Effect

T Reactive Systems: Introduction and Finite State Automata

24 Uses of Turing Machines

FINITE STATE AND TURING MACHINES

Holland s GA Schema Theorem

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Turing Machines, Part I

GA as a Data Optimization Tool for Predictive Analytics

NP-Completeness and Cook s Theorem

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Teaching Formal Methods for Computational Linguistics at Uppsala University

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Intrusion Detection via Static Analysis

Estimation of the COCOMO Model Parameters Using Genetic Algorithms for NASA Software Projects

Theory of Computation Chapter 2: Turing Machines

Lecture 2: Regular Languages [Fa 14]

Fixed-Point Logics and Computation

Practical Applications of Evolutionary Computation to Financial Engineering

The Influence of Binary Representations of Integers on the Performance of Selectorecombinative Genetic Algorithms

Automata on Infinite Words and Trees

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

Computability Theory

Scanner. tokens scanner parser IR. source code. errors

Pushdown Automata. International PhD School in Formal Languages and Applications Rovira i Virgili University Tarragona, Spain

CS5310 Algorithms 3 credit hours 2 hours lecture and 2 hours recitation every week

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

1 Definition of a Turing machine

Transcription:

Genetic programming with regular expressions Børge Svingen Chief Technology Officer, Open AdExchange bsvingen@openadex.com 2009-03-23

Pattern discovery Pattern discovery: Recognizing patterns that characterize features in data Type of data Meteorological data DNA Seismic data Financial data Example feature Bad weather Predisposition for disease Presence of oil Changes in stock prices

Purpose of this lecture Three things: Practical how-to on pattern discovery Provide an example of using formal methods for solving a practical problem Demonstrate a promising topic for future work

Pattern discovery in sequences We focus on finding patterns in sequences: Biological sequences (DNA, RNA, amino acids etc.) Time series (temperature, stock prices, etc.) Mathematical sequences (arithmetic, geometric etc.)

What do sequences have in common? What do sequences that share a feature have in common? What do genetic sequences that give a predisposition for a disease have in common? What do stock price time series that lead to a crack have in common? What do geometric sequences have in common?

Training sets Training sets: Input to the pattern discovery algorithm Positive training set: Contains sequences that have the feature Negative training set: Contains sequences that do not have the feature Negative training set not always present One solution: Use random sequences as negative training set

Representing sequences - languages Formal definitions: Alphabet: A set of characters. String: A finite sequence over an alphabet. Language: A set of strings. We want to represent languages, i.e., the set of strings of the training sets

Representing sequences - types of languages Types of languages: Regular languages. Can be decided by a finite automaton. Context-free languages. Can be decided by a push-down automaton. Context-sensitive languages. Can be decided by a Turing machine with finite memory. Recursive languages. Can be decided by a Turing machine. Recursively enumerable languages. Can be enumerated by Turing machines. We will focus on regular languages.

Deterministic finite automata (DFA) 0 Represents the language described by the strings s 0 0 1 s 1 s 2 1 1 0 s 3 01 001 0001 00001... 10 110 1110 11110...

DFA definition A deterministic finite automaton is a 5-tuple (Q, Σ, δ, q 0, F ) where A finite set of states Q. An alphabet Σ. A transition function δ : Q Σ Q. A start state q 0 Q. A set of accept states F Q.

DFA example 0 s 0 0 s 1 1 s 3 Q = {s 0, s 1, s 2, s 3 } Σ = {0, 1} q 0 = s 0 F = {s 3 } 1 s 2 1 0 s0 s1 s2 0 s1 s1 s3 1 s2 s3 s2

Nondeterministic finite automata (NFA) a b s 1 a ɛ s 2 s 3 a,b NFAs have multiple choices for moving between states. Must evaluate all options. In multiple states at once.

NFA definition A nondeterministic finite automaton is a 5-tuple (Q, Σ, δ, q 0, F ) where A finite set of states Q. An alphabet Σ. A transition function δ : Q Σ ɛ P(Q). A start state q 0 Q. A set of accept states F Q.

Evolutionary algorithms Using evolution for solving problems: A population of solutions Selection based on fitness (how well the solution solves the problem) Reproduction with mutation Repeat for a number of generations Initial generation Evaluation Good enough? Yes Done No Reproduction Selection

Types of evolutionary algorithms Evolutionary programming Genetic programming Genetic algorithms Evolution strategy Learning classifier systems Evolutionary algorithms

Genetic programming - evolving programs In GP the individuals of the population are programs The programs are in the form of trees (can be seen as parse trees) Fitness is evaluated by running the program > x 3 if x 4

Examples of GP applications Designing electric circuits Optimization problems Robot control Pattern discovery Symbolic regression + x * 7 x

Fitness Fitness tells us how good a program is at solving the problem. Fitness is calculated by a fitness function. The fitness of a program decides the probability of being selected for the next generation. The goal of genetic programming is to optimize the fitness function. Important: The fitness function needs to allow for gradual improvements.

The fitness function Different types of fitness: Raw fitness. Application specific context. f r (i, t) gives raw fitness for individual i in generation t. Standardized fitness. Standardized fitness f s (i, t) is raw fitness adjusted so that lower values are better and 0 is best. Adjusted fitness. Adjusted fitness is standardized fitness adjusted so that all fitness values fall between 0 and 1, with 1 1 being the best. f a (i, t) = 1+f. s(i,t) Normalized fitness. Normalized fitness is adjusted fitness normalized so that the sum of program fitness over the whole population is 1. f n (i, t) = P fa(i,t) M where M is the k=1 fa(k,t) population size.

Program primitives Programs are built from a function set and a terminal set. An important property is closure: All functions should accept all values returned by other functions or terminals. In this example, F {if, >} > x 3 if x 4 and T {x, N}

The function set The function set Are the internal nodes of the program tree Has one or more children providing input Can be functional or have side effects > x 3 if x 4

Terminal set The terminal set Are the leaf nodes of the program tree Can have side effects Ephemeral terminals is a special case, typically used for constants > x 3 if x 4

Growing trees The initial population consists of random trees Functions and terminals randomly selected Two main ways of building random trees of a given depth: The full method: All leaves have the same depth. The grow methods: Randomly choose between functions and terminals, create leaves of different depth. The ramped half-and half method: Equally distributed between different depths For each tree of a given depth, randomly choose between the full or grow method Creates tree shape diversity

Growing trees - full + + / / y 7 x 2 x 3 x 4-7 x

Growing trees - grow + 7 / y x 2 x 3-7 x

Reproduction - crossover + - y 7 x x 3

Reproduction - crossover results + - x y 7 x 3

Crossover maintains building blocks Crossover point is selected randomly. Whole subtrees are exchanged between programs. The subtrees represent a separate piece of functionality. This causes building blocks of good solutions to survive to future generations, and then recombine. + x 3 y

Genetic programming with search We want to find patterns. Solution: Genetic programming where the programs are queries. The patterns are represented by queries. The programs are queries. yes OR AND no maybe

Evolving queries Every member of the population is a query. We evaluate each query by searching the training sets. The fitness function is given by how close the queries match the training sets. Trivial fitness: Count number of incorrect classifications.

Genetic programming with search - an example An example of genetic programming with search ([3, 2, 5]): Genetic programming done on the genetic programming mailing list. Simple single word based search. Trying to classify articles about GP selection methods. GP done on positive and negative training sets. Results tested on separate test set. ADF1 (IF (OR P0 (PRESENT candidate)) (IF (+ (PRESENT tournament) (PRESENT demes) ) 1 P0 ) (IF (PRESENT tournaments) 8607 (IF (PRESENT tournament) 1 (PRESENT (- (PRESENT scant) 1)) ) ) ) ADF2 (+ 3980 (NOT P0)) ADF3 (IF (PRESENT tournament) 1 (- (ADF1 P0)) ) RPB0 (IF (ADF2 1 1) (- (- (PRESENT deme)) (ADF3 (PRESENT pet)) ) (ADF3 0)) RPB1 (IF (PRESENT galapagos) 5976 (PRESENT deme) ) RPB2 1

Picking a query language There are a number of query languages available (SQL, XQuery, SPARQL...) For sequences: Regular expressions Advantage with regard to GP: Regular expressions can be seen as trees ab c a b c

Regular expressions Regular expressions can be defined by the following grammar: R a, for some a Σ (1) R ɛ (2) R (3) R (R R) (4) R (R R) (5) R (R ) (6) Σ is here the alphabet used, (R 1 R 2 ) matches either R 1 or R 2, and (R 1 R 2 ) matches R 1 followed by R 2. (R 1 ) matches any number of occurrences of R 1.

Why regular languages are called regular Regular expressions represent regular languages. Important consequence of this: Regular expressions and DFAs are equivalent. DFA equivalent to ab c shown in the figure. b s 0 a s 1 c s 2

Equivalence proof for DFA and regular expressions Proof outline: DFAs and NFAs are equivalent DFA NFA is trivial, DFA NFA. NFA DFA: Create DFA with collective states. Regular expression DFA 1. Build NFA recursively for regular expression 2. Convert NFA to DFA DFA regular expression More complex... The main idea is to use GNFAs, NFAs where the edges may contain regular expressions, and convert the GNFA to a regular expression

Pattern evolution An algorithm for evolving DFAs: 1. Use GP to find regular expressions. 2. Convert the regular expressions to DFA.

A practical example [4] Used the Tomita benchmark languages, a set of seven regular languages. For each language, used positive and negative training sets of 500 strings, the latter randomly created. Each GP individual was a regular expression tree. Each regular expression tree was evaluated on the training sets by creating a DFA. Population size of 10000 over 100 generations.

The Tomita benchmark languages Language Description TL1 1* TL2 (10)* TL3 no odd 0 strings after odd 1 strings TL4 no 000 substrings TL5 an even number of 01 s and 10 s TL6 number of 1 s - number of 0 s is multiple of 3 TL7 0*1*0*1*

Function set Function Arity Explanation + 2 Builds an automaton that accepts any string accepted by one of the two argument automata.. 2 Builds an automaton that accepts any string that is the concatenation of two strings that are accepted by the two argument automata, respectively. * 1 Builds an automaton that accepts any string that is the concatenation of any number of strings where each string is accepted by the argument automaton.

Terminal set Terminal Explanation 0 Returns an automaton accepting the single character 0. 1 Returns an automaton accepting the single character 1.

Results 1-4 Language Solution Simplified Solution TL1 (* (* 1)) 1* TL2 (* (* (. 1 0))) (10)* TL3 (. (* (+ (. 1 (+ (+ 1 1) (. 1 0))) (* 0))) (. (* (+ (. 1 (+ (+ 1 (. 0 0)) (. 1 0))) (* (. 0 0)))) (* 1))) TL4 (+ 0 (. (+ (* (+ 1 (. 0 (+ 1 (. 0 1))))) 1) (+ (+ (. (. 0 0) (* 1)) 0) (* (+ 1 (. (+ (. 1 0) 1) 0)))))) (11 110 0)*(11 100 110 00)*1* ((1 01 001)* 001* 0) (1 100 10)*

Results 5-7 Language Solution Simplified Solution TL5 (+ (* (+ 0 (. (. 0 (* (* (. (* 0) 1)))) 0))) (* (. (. 1 (* (. (* 0) 1))) (* 1)))) TL6 (* (+ (* (+ (. 1 (. (* (. 1 0)) 0)) (. (. 0 (* (* (. 0 1)))) 1))) (+ (* (+ (. 1 (. 1 1)) (. (. (. 1 1) (* (. 0 1))) 1))) (. (. (. 0 (* (* (. 0 1)))) 0) 0)))) TL7 (. (. (. (* (* 0)) (* 1)) (* 0)) (* (+ 1 1))) (0 0(0*1)*0)* (1(0*1)*1*)* (1(10)*0 0(01)*1 11(01)*1 0(01)*00)* 0*1*0*1*

Pattern Matching Chip (PMC)

The end.

Bibliography I Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, and Olaf Birkeland. A recursive MISD architecture for pattern matching. IEEE Transactions on Very large Scale Integration (VLSI) Systems, 12(7):727 734, July 2004. Børge Svingen. GP++ an introduction. In John R. Koza, editor, Late Breaking Papers at the 1997 Genetic Programming Conference, pages 231 239, Stanford University, CA, USA, 13 16 July 1997. Stanford Bookstore. Børge Svingen. Using genetic programming for document classification. In John R. Koza, editor, Late Breaking Papers at the 1997 Genetic Programming Conference, pages 240 245, Stanford University, CA, USA, 13 16 July 1997. Stanford Bookstore.

Bibliography II Børge Svingen. Learning regular languages using genetic programming. In John R. Koza, Wolfgang Banzhaf, Kumar Chellapilla, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max H. Garzon, David E. Goldberg, Hitoshi Iba, and Rick Riolo, editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 374 376, University of Wisconsin, Madison, Wisconsin, USA, 22-25 July 1998. Morgan Kaufmann. Børge Svingen. Using genetic programming for document classification. In Diane J. Cook, editor, Proceedings of the Eleventh Interational Florida Artificial Intelligence Research Symposium Conference. AAAI Press, 1998.

Bibliography III Michael Sipser. Introduction to the Theory of Computation. PWS Publishing Company, 1997. M. Tomita. Dynamic construction of finite-state automata from examples using hill climbing. In Proceedings of the Fourth Annual Cognitive Science Conference, pages 105 108, Ann Arbor, MI, 1982.