Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software

Similar documents
Testing for Duplicate Payments

Why is Internal Audit so Hard?

Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System

Using SQL Queries in Crystal Reports

Database Query 1: SQL Basics

Access Queries (Office 2003)

Using Technology to Automate Fraud Detection Within Key Business Process Areas

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

Microsoft Access 2007 Module 1

ACL Command Reference

Programming with SQL

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Excel Database Management Microsoft Excel 2003

How To Use Excel With A Calculator

Microsoft Access 2003 Module 1

White Paper. Blindfolded SQL Injection

Workflow Solutions for Very Large Workspaces

Searching Guide Version 8.0 December 11, 2013

Lesson 07: MS ACCESS - Handout. Introduction to database (30 mins)

Understanding Data De-duplication. Data Quality Automation - Providing a cornerstone to your Master Data Management (MDM) strategy

Wave Analytics Data Integration

Completing an Accounts Payable Audit With ACL (Aired on Feb 15)

Tips and Tricks SAGE ACCPAC INTELLIGENCE

Payroll Time Clock Import - Quick Start Instructions

SQL Server 2008 Core Skills. Gary Young 2011

CRM Global Search: Installation & Configuration

Studio Designer 80 Guide

PeopleSoft Query Training

SPSS (Statistical Package for the Social Sciences)

2874CD1EssentialSQL.qxd 6/25/01 3:06 PM Page 1 Essential SQL Copyright 2001 SYBEX, Inc., Alameda, CA

AGA Kansas City Chapter Data Analytics & Continuous Monitoring

Best Practices in Duplicate Invoice Detection

The Best Kept Secrets to Using Keyword Search Technologies

SQL Server Database Coding Standards and Guidelines

Data analysis for Internal Audit

USC Marshall School of Business Academic Information Services. Excel 2007 Qualtrics Survey Analysis

Adding and Managing Records and Contacts GroundWork group 1880 Mackenzie Drive, Suite 111, Columbus, OH Phone:

Introduction to IBM Watson Analytics Data Loading and Data Quality

SQL - QUICK GUIDE. Allows users to access data in relational database management systems.

Using Microsoft Access

Alarms & Events Plug-In Help Kepware, Inc.

Estimating and Vendor Quotes An Estimating and Vendor Quote Workflow Guide

Oracle Database: SQL and PL/SQL Fundamentals

Information Systems SQL. Nikolaj Popov

10426: Large Scale Project Accounting Data Migration in E-Business Suite

PORTFOLIOCENTER USING DATA MANAGEMENT TOOLS TO WORK MORE EFFICIENTLY

Click to create a query in Design View. and click the Query Design button in the Queries group to create a new table in Design View.

Using SQL Server Management Studio

Sample- for evaluation purposes only. Advanced Crystal Reports. TeachUcomp, Inc.

4. The Third Stage In Designing A Database Is When We Analyze Our Tables More Closely And Create A Between Tables

In search of Excellence Series Research - Study Material No. 18

Introductions, Course Outline, and Other Administration Issues. Ed Ferrara, MSIA, CISSP Copyright 2015 Edward S.

Data.com Record Matching in Salesforce

EndNote Beyond the Basics

ACL WHITEPAPER. Automating Fraud Detection: The Essential Guide. John Verver, CA, CISA, CMC, Vice President, Product Strategy & Alliances

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Copyright 2006 ACL Services Ltd. All rights reserved.

Oracle SQL. Course Summary. Duration. Objectives

Tutorial 3 Maintaining and Querying a Database

Inquiry Formulas. student guide

What's New in ADP Reporting?

Oracle Fusion Middleware

GOVERNANCE: Enhanced Controls Needed To Avoid Duplicate Payments

Macros allow you to integrate existing Excel reports with a new information system

FrontStream CRM Import Guide Page 2

Access Tutorial 3 Maintaining and Querying a Database. Microsoft Office 2013 Enhanced

Fraud Workshop Finding the truth in the transactions

Tutorial 5 Creating Advanced Queries and Enhancing Table Design

Database Design Basics

SPSS: Getting Started. For Windows

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets

Oracle Database: SQL and PL/SQL Fundamentals NEW

Microsoft Access 3: Understanding and Creating Queries

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Creating Tables ACCESS. Normalisation Techniques

Introduction to SQL and SQL in R. LISA Short Courses Xinran Hu

Excel Templates. & Quote/Invoice Maker for ACT! Another efficient and affordable ACT! Add-On by V

Data Analytics For the Restaurant Industry

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Introduction to Microsoft Access 2003

Configuring Event Log Monitoring With Sentry-go Quick & Plus! monitors

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt

Contact Treasury Management Support: (toll free) Monday through Friday, 7:30 am 5:30 pm (Pacific Time)

OpenFlow 1.4. (Changes compared to 1.3 OpenDaylight Perspec>ve) - Abhijit Kumbhare

3.GETTING STARTED WITH ORACLE8i

Using Data Analytics to Detect Fraud

Accounts Payable System Administration Manual

Management Information Systems 260 Web Programming Fall 2006 (CRN: 42459)

Format OCR ICR. ID Protect From Vanguard Systems, Inc.

Microfinance Credit Risk Dashboard User Guide

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

SnapLogic Tutorials Document Release: October 2013 SnapLogic, Inc. 2 West 5th Ave, Fourth Floor San Mateo, California U.S.A.

Instant SQL Programming

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Transcription:

Fuzzy Matching in Audit Analytics Grant Brodie, President, Arbutus Software

Outline What Is Fuzzy? Causes Effective Implementation Demonstration Application to Specific Products Q&A 2

Why Is Fuzzy Important? Big data Too many transactions User-entered data (web sites) E-Commerce Less manual oversight 3

What Is Fuzzy? Subset of duplicates testing Find specific keywords in text (FCPA, PCard) Close, but not the same Two reasonable definitions Proximity Looks similar 4

Proximity Sorts close together Characters Albert vs. Albertson Numbers 123,456.78 vs. 123,792.16 Dates Jan 19, 2014 vs. Jan 20, 2014 5

Looks Similar Characters Microsoft vs. Wicrosoft Numbers 127,894.63 vs. 12,894.63 Dates Jan 13, 2014 vs. Jan 31, 2014 6

Traditional Approach to Close Pronunciation based Soundex NYSIIS Designed for names Many false positives Not useful for numbers or dates 7

Fuzzy Today Based on physical string matching Levenshtein (ACL) Damerau-Levenshtein (Arbutus) N-Gram Jaro-Winkler And many more Differences expressed as a distance or percentage 8

Quick Lesson: Damerau-Levenshtein Min. # changes to make one string into another Insert, delete, replace, transpose 123 Main Street vs. 123 Main St = 4 34567 vs. 34576 = 1 (Levenshtein: 2) Rob vs. Robert = 3 Gary vs. Mary = 1 Gary vs. gary = 1 9

Problems with String Matching Very literal Doesn t apply any context John Smith vs. John Smith (1) Smith John vs. Smith, John (1) John Smith vs. john smith (2) México vs. Mexico (1) John Smith vs. john smith same as John Hmitz (2) 10

What Do You Use? Whatever your tool offers Almost impossible to implement manually VERY compute intensive 11

Causes Accidental errors Carelessness/mistyping Transpositions Blurry source Punctuation Extra blanks 1 vs. I, 0 vs. O (particularly with OCR) 12

Errors vs. Fraud All of the causes were likely errors Fraud uses intentional errors to mask activity Obscure duplicates Obscure relationships Trick through similarity Disparate systems make comparison even harder 13

Practical Issues Generally hard to target fuzzy tests Forced to use broad tests Most findings will be errors Even so, the finding is still valuable Need a process to address errors found 14

Our System Catches Duplicates Exact matches only Strict application (i.e. company, vendor, invoice) May only warn Not all duplicates are payments Most only test document numbers 15

Types of Duplicates Names Personal Corporate Addresses Document numbers (e.g., invoice) Contact information Phone numbers Emails 16

Issues Very compute intensive (wait times) Exponential relationship 1000x data = 1,000,000x more work False positives Ease of use 17

False Positives Easily the most challenging aspect Any time spent on a false positive is wasted Can easily outnumber the true positives by 10, 100, 1000 to 1 If too many, can remove any cost effectiveness How does this happen? Only one way to get an exact match Virtually unlimited ways to get close 18

False Positive Examples Matching to 12345 with a single difference: Missing (1245): 5, Transposition (12435): 4 Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char) Extra (123345): min 60 (200+ if alpha, 1,000+ if any char) Hundreds/thousands of ways that differ by just 1 Not just errors, all close values Exponentially more with a distance of 2 Bad actor tries to rely on his needle in a haystack 19

How to Address the Issues Data preparation Utilize context Use tight specifications Choose software that meets needs Rank your results 20

Choose Your Software Has the capabilities you need Can process your data volumes Easy to implement Easy to automate ACL, Arbutus, IDEA, fraud-specific, non-audit tools 21

Data Preparation Remove immaterial differences first (i.e., normalization) Text manipulation Upper case Punctuation Extra blanks Foreign characters (México vs. Mexico, Québec vs. Quebec) 22

Data Preparation (Cont.) (Remove immaterial differences first, normalization) Eliminate noise words Different by type of data Address: Suite, Unit Corporate name: Company, Co, Inc Personal name: Mr, Ms, Dr, Prof 23

Data Preparation (Cont.) (Remove immaterial differences first, normalization) Common misspellings/typos Common vocabulary (chair vs. silla) Different by data type Avenue: Av, Ave, Aven, Avenu First vs. 1 st West vs. W Richard, Rick, Dick, Ricky, Rich 24

Data Preparation (Cont.) (Remove immaterial differences first, normalization) Word order 123 W Main St. vs. 123 Main St. W 25

Data Preparation: Result Well implemented data prep. minimizes the need for fuzzy Consider the two addresses: #200-1234 Main Street West 1234 W MAIN ST, Suite 200 Levenshtein distance is 20 Applying data prep can make both strings identical W ST MAIN 200 1234 26

False Positive Reduction: Utilize Context Data elements always have a context Names or address: location (e.g., city, state, ZIP, country, etc.) Documents: vendor, employee, etc. Reference the similarities to minimize the ambiguity Same state, city, similar address 123 Main St., Springfield, IL/MA Same vendor, date, amount, similar invoice number 27

False Positive Reduction: Use Tight Specs Levenshtein distance 1, or 2 max Looser specifications = more false positives Avoid Soundex and similar approaches There is no substitute for good data prep 28

False Positives: Rank Your Results Order based on exposure Size of item Degree of inherent risk (cash) Order based on degree of similarity Distance (1 vs. 2) Number of matching same elements 29

Continuous Monitoring Mostly errors Test vs. control Ownership of the process May relate to frequency Detective vs. Preventative Entire presentation detective Opportunity to run against documents before committing Preventative almost certainly a control 30

Fuzzy Testing in action Demonstration 31

Text Manipulation: ACL Create a computed field Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler) Punctuation: Include(field, 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ ), but Extra blanks: (replace 2 with 1) Replace(Replace(field,, ),, ) Foreign characters: Replace(Replace(field É, E ), Á, A ) Replace(Replace(Replace(Replace(Include(Upper(field), 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ ),, ),, ),, ), É, E ) In practice, many more replace calls May break up into multiple fields for clarity 32

Text Manipulation: Arbutus Create a computed field Upper case: Upper(field) Punctuation: Include(field, 0~9A~Z ), but Extra blanks: Compact(field) Foreign characters: Replace(field, É, E, Á, A, ) Replace(Compact(Include(Upper(field), 0~9A~Z )), É, E ) May break up into multiple fields for clarity Only for unusual situations (use Normalize function) 33

Eliminate Noise Words: ACL Use whole words Omit(field+, INCORPORATED,INC,LIMITED,LTD, F), but Don t: Omit(field, INC ): CINCH INDUSTRIES becomes CH INDUSTRIES Problem is, many noise words to eliminate two solutions: Long list Omit(field+, INCORPORATED,INC,LIMITED,LTD,CORPORATION, CORP, ) Sequential omits of a variable in a group v_field=omit(field+, INCORPORATED,INC ) v_field=omit(v_field +, LIMITED,LTD ) 34

Common Vocabulary: ACL Similar to noise words, only Replace instead of Omit Use whole words Replace(field+, ROAD, RD ) Otherwise, BROADWAY becomes BRDWAY Don t omit, as Peachtree Lane is not the same as Peachtree Court Problem is, MANY vocabulary words to potentially normalize USPS 400 street terms, 500+ male names, 700+ female names Nested functions (with Replace instead of Omit) Sequential replaces of a variable in a group 35

Word Order: ACL No practical way to address this 36

Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works Instead: Use Normalize() or SortNormalize() Automatically implements ALL of the data prep described (Upper case, punctuation, blanks, foreign, noise, vocabulary) Normalize(address, addr.txt ) Norm( Suite 200-1234 Main Street West, addr.txt ) = 200 1234 MAIN ST W SortNormalize has the same syntax, but = W ST MAIN 200 1234 Normalize can use a separate vocabulary file (addr.txt) Replaces or omits any word, on a whole word basis User configurable and selectable, by data type 37

Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example) FIRST 1ST SEVENTH 7TH AV AVENU AVENUE AVN AVE AVE AVE AVE PARKWAY PKWY PARKWY PKWAY PKY PKWY PKWY PKWY SUITE UNIT 38

Utilize Context: Application ACL FUZZYDUP: Only supports one key field Concatenate fields into a single expression/computed field State+City+Address Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno Arbutus DUPLICATES: Supports multiple key fields Specify each key separately Last key can be fuzzy 39

Execution: ACL Separate menu item Analyze/fuzzy duplicates Choose your (concatenated) key Choose diff. threshold (1 or 2) Select other fields to use in investigation Select the output table name Be patient 40

Execution: Arbutus Included with duplicates testing Analyze/duplicates Choose your key fields (any type) Choose either near or similar processing Choose max. difference (0, 1, or 2) Select other fields to use in investigation Select output location and name 41

Similar Processing: Arbutus Specifically designed to work with document IDs Uses Damerau-Levenshtein, but auto. pre-processes Removes all blanks and punctuation, upper cases Matches similar characters: O=0, I=1, 5=S, etc. Works on all data types 127,894.63 vs. 12,894.63 (diff. 1) I-12345 vs. 112345 (diff. 0) Particularly useful with OCR 42

Similar Processing: ACL Not explicitly supported Pre-process the data to create a computed field Upper case Include only numbers and letters (no blanks, punctuation) Convert numbers and dates to strings (date or string) Use the FUZZYDUP command as in the past 43

Manual Duplicates Testing: ACL Data prep is still important LevDist(string1, string2 <, case sensitive>) Case sensitive by default Filter: LevDist(name1, name2, F) < 3 IsFuzzyDup(string1, string2, distance <, diff%> ) Automatically case insensitive Filter: IsFuzzyDup(name1, name2, 2) Can also be used as a join test 44

Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs) Difference(string1, string2 <, case sensitive>) Filter: difference(name1, name2, F) < 3 Near(field1, field2, difference) Filter: near(name1, name2, 2) Applies to all data types Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803) Similar(field, field2, difference) Applies to all data types, always uses Damerau-Levenshtein Char: prepared data; numbers and dates: 123,456 vs. 12,456 45

Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA Use the Find function: Filter: IF Find( Exotic, desc) Multiple words: IF Find( Exotic, desc) OR Find( IPad, desc) Not case sensitive, not whole word Create a Logical computed field (say Exception ): T IF Find( Exotic, desc) T IF Find( IPad, desc) F Filter: IF Exception 46

Find Specific Keywords in Text: Arbutus Find function works the same as ACL Use the ListFind function instead: Filter: IF ListFind( exceptions.txt, desc) Simple text file Easily maintained in Notepad Unlimited entries Supports an external reference file or an internal array Like Find function, not case sensitive, not whole word 47

Q & A Questions 48