Fuzzy Matching in Audit Analytics Grant Brodie, President, Arbutus Software
Outline What Is Fuzzy? Causes Effective Implementation Demonstration Application to Specific Products Q&A 2
Why Is Fuzzy Important? Big data Too many transactions User-entered data (web sites) E-Commerce Less manual oversight 3
What Is Fuzzy? Subset of duplicates testing Find specific keywords in text (FCPA, PCard) Close, but not the same Two reasonable definitions Proximity Looks similar 4
Proximity Sorts close together Characters Albert vs. Albertson Numbers 123,456.78 vs. 123,792.16 Dates Jan 19, 2014 vs. Jan 20, 2014 5
Looks Similar Characters Microsoft vs. Wicrosoft Numbers 127,894.63 vs. 12,894.63 Dates Jan 13, 2014 vs. Jan 31, 2014 6
Traditional Approach to Close Pronunciation based Soundex NYSIIS Designed for names Many false positives Not useful for numbers or dates 7
Fuzzy Today Based on physical string matching Levenshtein (ACL) Damerau-Levenshtein (Arbutus) N-Gram Jaro-Winkler And many more Differences expressed as a distance or percentage 8
Quick Lesson: Damerau-Levenshtein Min. # changes to make one string into another Insert, delete, replace, transpose 123 Main Street vs. 123 Main St = 4 34567 vs. 34576 = 1 (Levenshtein: 2) Rob vs. Robert = 3 Gary vs. Mary = 1 Gary vs. gary = 1 9
Problems with String Matching Very literal Doesn t apply any context John Smith vs. John Smith (1) Smith John vs. Smith, John (1) John Smith vs. john smith (2) México vs. Mexico (1) John Smith vs. john smith same as John Hmitz (2) 10
What Do You Use? Whatever your tool offers Almost impossible to implement manually VERY compute intensive 11
Causes Accidental errors Carelessness/mistyping Transpositions Blurry source Punctuation Extra blanks 1 vs. I, 0 vs. O (particularly with OCR) 12
Errors vs. Fraud All of the causes were likely errors Fraud uses intentional errors to mask activity Obscure duplicates Obscure relationships Trick through similarity Disparate systems make comparison even harder 13
Practical Issues Generally hard to target fuzzy tests Forced to use broad tests Most findings will be errors Even so, the finding is still valuable Need a process to address errors found 14
Our System Catches Duplicates Exact matches only Strict application (i.e. company, vendor, invoice) May only warn Not all duplicates are payments Most only test document numbers 15
Types of Duplicates Names Personal Corporate Addresses Document numbers (e.g., invoice) Contact information Phone numbers Emails 16
Issues Very compute intensive (wait times) Exponential relationship 1000x data = 1,000,000x more work False positives Ease of use 17
False Positives Easily the most challenging aspect Any time spent on a false positive is wasted Can easily outnumber the true positives by 10, 100, 1000 to 1 If too many, can remove any cost effectiveness How does this happen? Only one way to get an exact match Virtually unlimited ways to get close 18
False Positive Examples Matching to 12345 with a single difference: Missing (1245): 5, Transposition (12435): 4 Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char) Extra (123345): min 60 (200+ if alpha, 1,000+ if any char) Hundreds/thousands of ways that differ by just 1 Not just errors, all close values Exponentially more with a distance of 2 Bad actor tries to rely on his needle in a haystack 19
How to Address the Issues Data preparation Utilize context Use tight specifications Choose software that meets needs Rank your results 20
Choose Your Software Has the capabilities you need Can process your data volumes Easy to implement Easy to automate ACL, Arbutus, IDEA, fraud-specific, non-audit tools 21
Data Preparation Remove immaterial differences first (i.e., normalization) Text manipulation Upper case Punctuation Extra blanks Foreign characters (México vs. Mexico, Québec vs. Quebec) 22
Data Preparation (Cont.) (Remove immaterial differences first, normalization) Eliminate noise words Different by type of data Address: Suite, Unit Corporate name: Company, Co, Inc Personal name: Mr, Ms, Dr, Prof 23
Data Preparation (Cont.) (Remove immaterial differences first, normalization) Common misspellings/typos Common vocabulary (chair vs. silla) Different by data type Avenue: Av, Ave, Aven, Avenu First vs. 1 st West vs. W Richard, Rick, Dick, Ricky, Rich 24
Data Preparation (Cont.) (Remove immaterial differences first, normalization) Word order 123 W Main St. vs. 123 Main St. W 25
Data Preparation: Result Well implemented data prep. minimizes the need for fuzzy Consider the two addresses: #200-1234 Main Street West 1234 W MAIN ST, Suite 200 Levenshtein distance is 20 Applying data prep can make both strings identical W ST MAIN 200 1234 26
False Positive Reduction: Utilize Context Data elements always have a context Names or address: location (e.g., city, state, ZIP, country, etc.) Documents: vendor, employee, etc. Reference the similarities to minimize the ambiguity Same state, city, similar address 123 Main St., Springfield, IL/MA Same vendor, date, amount, similar invoice number 27
False Positive Reduction: Use Tight Specs Levenshtein distance 1, or 2 max Looser specifications = more false positives Avoid Soundex and similar approaches There is no substitute for good data prep 28
False Positives: Rank Your Results Order based on exposure Size of item Degree of inherent risk (cash) Order based on degree of similarity Distance (1 vs. 2) Number of matching same elements 29
Continuous Monitoring Mostly errors Test vs. control Ownership of the process May relate to frequency Detective vs. Preventative Entire presentation detective Opportunity to run against documents before committing Preventative almost certainly a control 30
Fuzzy Testing in action Demonstration 31
Text Manipulation: ACL Create a computed field Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler) Punctuation: Include(field, 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ ), but Extra blanks: (replace 2 with 1) Replace(Replace(field,, ),, ) Foreign characters: Replace(Replace(field É, E ), Á, A ) Replace(Replace(Replace(Replace(Include(Upper(field), 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ ),, ),, ),, ), É, E ) In practice, many more replace calls May break up into multiple fields for clarity 32
Text Manipulation: Arbutus Create a computed field Upper case: Upper(field) Punctuation: Include(field, 0~9A~Z ), but Extra blanks: Compact(field) Foreign characters: Replace(field, É, E, Á, A, ) Replace(Compact(Include(Upper(field), 0~9A~Z )), É, E ) May break up into multiple fields for clarity Only for unusual situations (use Normalize function) 33
Eliminate Noise Words: ACL Use whole words Omit(field+, INCORPORATED,INC,LIMITED,LTD, F), but Don t: Omit(field, INC ): CINCH INDUSTRIES becomes CH INDUSTRIES Problem is, many noise words to eliminate two solutions: Long list Omit(field+, INCORPORATED,INC,LIMITED,LTD,CORPORATION, CORP, ) Sequential omits of a variable in a group v_field=omit(field+, INCORPORATED,INC ) v_field=omit(v_field +, LIMITED,LTD ) 34
Common Vocabulary: ACL Similar to noise words, only Replace instead of Omit Use whole words Replace(field+, ROAD, RD ) Otherwise, BROADWAY becomes BRDWAY Don t omit, as Peachtree Lane is not the same as Peachtree Court Problem is, MANY vocabulary words to potentially normalize USPS 400 street terms, 500+ male names, 700+ female names Nested functions (with Replace instead of Omit) Sequential replaces of a variable in a group 35
Word Order: ACL No practical way to address this 36
Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works Instead: Use Normalize() or SortNormalize() Automatically implements ALL of the data prep described (Upper case, punctuation, blanks, foreign, noise, vocabulary) Normalize(address, addr.txt ) Norm( Suite 200-1234 Main Street West, addr.txt ) = 200 1234 MAIN ST W SortNormalize has the same syntax, but = W ST MAIN 200 1234 Normalize can use a separate vocabulary file (addr.txt) Replaces or omits any word, on a whole word basis User configurable and selectable, by data type 37
Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example) FIRST 1ST SEVENTH 7TH AV AVENU AVENUE AVN AVE AVE AVE AVE PARKWAY PKWY PARKWY PKWAY PKY PKWY PKWY PKWY SUITE UNIT 38
Utilize Context: Application ACL FUZZYDUP: Only supports one key field Concatenate fields into a single expression/computed field State+City+Address Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno Arbutus DUPLICATES: Supports multiple key fields Specify each key separately Last key can be fuzzy 39
Execution: ACL Separate menu item Analyze/fuzzy duplicates Choose your (concatenated) key Choose diff. threshold (1 or 2) Select other fields to use in investigation Select the output table name Be patient 40
Execution: Arbutus Included with duplicates testing Analyze/duplicates Choose your key fields (any type) Choose either near or similar processing Choose max. difference (0, 1, or 2) Select other fields to use in investigation Select output location and name 41
Similar Processing: Arbutus Specifically designed to work with document IDs Uses Damerau-Levenshtein, but auto. pre-processes Removes all blanks and punctuation, upper cases Matches similar characters: O=0, I=1, 5=S, etc. Works on all data types 127,894.63 vs. 12,894.63 (diff. 1) I-12345 vs. 112345 (diff. 0) Particularly useful with OCR 42
Similar Processing: ACL Not explicitly supported Pre-process the data to create a computed field Upper case Include only numbers and letters (no blanks, punctuation) Convert numbers and dates to strings (date or string) Use the FUZZYDUP command as in the past 43
Manual Duplicates Testing: ACL Data prep is still important LevDist(string1, string2 <, case sensitive>) Case sensitive by default Filter: LevDist(name1, name2, F) < 3 IsFuzzyDup(string1, string2, distance <, diff%> ) Automatically case insensitive Filter: IsFuzzyDup(name1, name2, 2) Can also be used as a join test 44
Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs) Difference(string1, string2 <, case sensitive>) Filter: difference(name1, name2, F) < 3 Near(field1, field2, difference) Filter: near(name1, name2, 2) Applies to all data types Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803) Similar(field, field2, difference) Applies to all data types, always uses Damerau-Levenshtein Char: prepared data; numbers and dates: 123,456 vs. 12,456 45
Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA Use the Find function: Filter: IF Find( Exotic, desc) Multiple words: IF Find( Exotic, desc) OR Find( IPad, desc) Not case sensitive, not whole word Create a Logical computed field (say Exception ): T IF Find( Exotic, desc) T IF Find( IPad, desc) F Filter: IF Exception 46
Find Specific Keywords in Text: Arbutus Find function works the same as ACL Use the ListFind function instead: Filter: IF ListFind( exceptions.txt, desc) Simple text file Easily maintained in Notepad Unlimited entries Supports an external reference file or an internal array Like Find function, not case sensitive, not whole word 47
Q & A Questions 48