Index support for regular expression search. Alexander Korotkov PGCon 2012, Ottawa



Similar documents
Unique column combinations

Lecture Notes on Database Normalization

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Unit 3 Boolean Algebra (Continued)

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Finite Automata. Reading: Chapter 2

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Die Welt Multimedia-Reichweite

Introduction to Finite Automata

How to bet using different NairaBet Bet Combinations (Combo)

Finite Automata. Reading: Chapter 2

10CS35: Data Structures Using C

The safer, easier way to help you pass any IT exams. SAP Certified Application Associate - SAP HANA 1.0. Title : Version : Demo 1 / 5

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

KNOWLEDGE IS POWER The BEST first-timers guide to betting on and winning at the races you ll EVER encounter

CH3 Boolean Algebra (cont d)


Compiler Construction

Data Mining Apriori Algorithm

Automata and Formal Languages

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Python Loops and String Manipulation

Converting Finite Automata to Regular Expressions

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

PES Institute of Technology-BSC QUESTION BANK

Domain Name System. DNS is an example of a large scale client-server application. Copyright 2014 Jim Martin

Association Analysis: Basic Concepts and Algorithms

Genetic programming with regular expressions

ASSIGNMENT ONE SOLUTIONS MATH 4805 / COMP 4805 / MATH 5605

Frequent item set mining

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Large-Scale Data Cleaning Using Hadoop. UC Irvine

Soccer Bet Types Content

Database Design and Normalization

Association Rule Mining

CIRCLE THEOREMS. Edexcel GCSE Mathematics (Linear) 1MA0

Boolean Algebra Part 1

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION GEOMETRY. Thursday, January 24, :15 a.m. to 12:15 p.m.

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

How To Find Out What A Key Is In A Database Engine

Lecture 18 Regular Expressions

1. Domain Name System

澳 門 彩 票 有 限 公 司 SLOT Sociedade de Lotarias e Apostas Mútuas de Macau, Lda. Soccer Bet Types

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Why is Internal Audit so Hard?

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Lecture 24: Saccheri Quadrilaterals

Informatique Fondamentale IMA S8

Being Regular with Regular Expressions. John Garmany Session

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

COLLEGE ALGEBRA. Paul Dawkins

Hint 1. Answer (b) first. Make the set as simple as possible and try to generalize the phenomena it exhibits. [Caution: the next hint is an answer to

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

Binary Coded Web Access Pattern Tree in Education Domain

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Finite Automata and Regular Languages

Listeners. Formats. Free Form. Formatted

28 Simply Confirming Onsite

Today s Agenda. Automata and Logic. Quiz 4 Temporal Logic. Introduction Buchi Automata Linear Time Logic Summary

DATA STRUCTURES USING C

Introduction to SQL for Data Scientists

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Business Process Modeling

International Journal of Advanced Research in Computer Science and Software Engineering

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

03 - Lexical Analysis

AREAS OF PARALLELOGRAMS AND TRIANGLES

Course Manual Automata & Complexity 2015

SPAMMING BOTNETS: SIGNATURES AND CHARACTERISTICS

Managing large sound databases using Mpeg7

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

University Convocation. IT 3203 Introduction to Web Development. Pattern Matching. Why Match Patterns? The Search Method. The Replace Method

Using Logic to Design Computer Components

Deterministic Finite Automata

Computational Geometry Lab: FEM BASIS FUNCTIONS FOR A TETRAHEDRON

Topic 11. Statistics 514: Design of Experiments. Topic Overview. This topic will cover. 2 k Factorial Design. Blocking/Confounding

Lecture 1. Basic Concepts of Set Theory, Functions and Relations

Regular Languages and Finite State Machines

: HP HP0-M28. : Implementing HP Asset Manager Software. Version : R6.1

CAD Algorithms. P and NP

ALTIRIS CMDB Solution 6.5 Product Guide

Lecture 2: Regular Languages [Fa 14]

Geometry Module 4 Unit 2 Practice Exam

Transcription:

Index support for regular expression search Alexander Korotkov PGCon 2012, Ottawa

Introduction

What is regular expressions? Regular expressions are: powerful tool for text processing based on formal language theory expressing same class of languages as finite automata

So automata Regular expression can be transformed into automaton. Moreover, such transformation is really used by regex engines

So automata Automaton is a graph which vertices are states and which arcs are labeled by characters. Automaton reads string if you can type that string by a traversal from initial state to final state

Example /a(bc)*d/ becomes

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz

Example xyzabcbcdxyz Finish! Match!

Okay, now all of us know Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Regex based search PostgreSQL can regex based search :) It s only a sequential search for a while :(

Inverted indexes on q-grams

Q-grams Q-gram is substring of length q which can be used as a signature of original string Widely used in various string processing tasks

Inverted index on q-grams Maintain association between q- gram and all the strings where it mentioned. pg_trgm has an implementation for q = 3

pg_trgm 1. "regular expressions", 2. "expressive speach", 3. "regular speaker" => ' e': {1,2} 'egu': {1,3} 'pre': {1,2} ' r': {1,3} 'er ': {3} 'reg': {1,3} ' s': {2,3} 'ess': {1,2} 'res': {1,2} ' ex': {1,2} 'exp': {1,2} 'sio': {1} ' re': {1,3} 'gul': {1,3} 'siv': {2} ' sp': {2,3} 'ion': {1} 'spe': {2,3} 'ach': {2} 'ive': {2} 'ssi': {1,2} 'ake': {3} 'ker': {3} 'ula': {1,3} 'ar ': {1,3} 'lar': {1,3} 've ': {2} 'ch ': {2} 'ns ': {1} 'xpr': {1,2} 'eac': {2} 'ons': {1} 'eak': {3} 'pea': {2,3} Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Q-grams frequencies From 2.5M of DBLP paper titles 360K contain trigram the Only 1 contains trigram zzz "Zzzzzzzzzzzzzzzzzzzzzzzzzz" chapter of the book "Formal Specification and Development in Z and B" by David Everett Not all q-grams of fixed q are equally useful.

V-grams or multigrams Each q-gram can have specific q Selectivity of all q-grams are similar (and low enough) More effective index search!

V-grams or multigrams Problems: Hard to maintain online

Patent trololo! Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

How to use it for regex search?

General idea /[ab]cde/ => (acd OR bcd) AND cde

General idea /[ab]cde/ => (acd OR bcd) AND cde How to do this in general case?

Existing approaches for q-gram extraction

Scholar paper Junghoo Ch and Sridhar Rajagopalan, A fast regular expression indexing engine, Proceedings 18th International Conference on Data Engineering, 2002 Still widely referenced as state of art work about indexing for regular expressions.

FREE method Extract tree of continuous string fraction from regex. Transform those continuous fractions to multigrams (q-grams with variable q). Use inverted index on multigrams for query evaluation

FREE method: example Tree for /(abcd efgh)(ijklm x*)/

Scholar paper: example Replace * nodes with NULL

Scholar paper: example NULL eats parent OR node

Scholar paper: example AND node eats child NULL

Scholar paper: example Simplify a bit

Scholar paper: example Expand continuous string fractions into trigrams

Google code search Was launched in 2006. Supports regex search. Google guys are smart. It can t be a sequential scan. It also seemed to be something better than previous technique. We don t know what... :(

We didn t know what until Google code search was closed in 2011 :( Russ Cox has published description of indexing technique in January 2012 http://swtch.com/~rsc/regex p/regexp4.html More than 5 years of intrigue!

Google code search method Get 5 characteristics about each part of regex: emptyable, exact, prefix, suffix, match. Recursively union them (with possible simplification) Use inverted index of trigrams for query evaluation (similar to pg_trgm)

Google code search method Original regex: /a(bc)+d/ a: {exact: a} bc: {exact: bc} d: {exact: d} (bc)+: {prefix:bc, suffix: bc} a(bc)+: {prefix:abc, suffix:bc} a(bc)+d: {prefix:abc, suffix:bcd}

Google code search method /a(bc)+d/ {prefix:abc, suffix:bcd} abc AND bcd

Proposed method

Proposed method Transform automaton into automaton like graph on trigrams Simplify that graph Use pg_trgm indexes

Transformation example /a(b+ c+)d/

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Transformation example Result could be simplified

abd abb bbd Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa acd acc ccd Transformation example Implemented simplification technique: collect following matrix. 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 Means: abd OR (abb AND bbd) OR acd OR (acc AND ccd)

Comparison on examples Regex: /(abc cba)def/ FREE: (abc OR cba) AND def GSC: def AND ((abc AND bcd AND cde) OR (ade AND bad AND cba)) My: (abc AND bcd AND cde AND def) OR (ade AND bad AND cba AND def) Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Comparison on examples Regex: /abc+de/ FREE: nothing GSC: abc AND cde My: (abc AND cde AND bcd) OR (abc AND cde AND bcc AND ccd)

Comparison on examples Regex: /(abc*)+de/ FREE: nothing GSC: nothing My: (abd AND bde) OR (abc AND bcd AND cde) OR (abc AND bcc AND ccd AND cde)

Comparison on examples Regex: /ab(cd)*ef/ FREE: nothing GSC: nothing My: (abe AND bef) OR (abc AND bde AND cde AND def)

Performance results 2.5 M DBLP paper titles of 47 avg. length Regex Index scan Seq scan /database.*(sql query)/ 773 ms 18653 ms /postgres(ql)?/ 268 ms 17574 ms /plan+er/ 253 ms 12885 ms /(nucl anino).*acid/ 200 ms 20085 ms /[aei](bc)+a/ 2 ms 13195 ms

WIP patch for pg_trgm WIP patch was posted to mailing list: http://archives.postgresql.org/pgsqlhackers/2011-11/msg01297.php Any feedback is welcome. Index support for regular expression search, Alexander Korotkov, PGCon 2012, Ottawa

Problems Possible large transformed graph Possible large simplified presentation and it s high computational complexity Usage of trigrams rather than v-grams or multigrams

What did we miss? Difference between deterministic and nondeterministic automata Grouping characters into colors Handling of start/end of string/line

Help needed Regular expressions and string collections from reallife tasks for proving effectiveness of proposed method.

Thank you for attention!