Why is Internal Audit so Hard? 2 2014
Why is Internal Audit so Hard? 3 2014
Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014
Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets The end of hand calculation 5 2014
2 nd Wave: ERPs ERPs all our data in one place Database analysis Opens the Age of Rules Personal Computers Electronic Spreadsheets The end of hand calculation 6 2014
2 nd Wave Also Opens the Age of CAATs Beginner s CAATs: Basic database manipulation: join, summarize, append, stratify, sample, extract Basic testing: duplicates, gaps Intermediate CAATs: Automate our rules and (limited) automated testing. (for example in purchase-to-pay) o o o o o o o o o P.O. with blank / zero amount Split P.O.s Duplicate invoices Invoice amount paid > goods received Invoices with no matching receiving report Multiple invoices for same P.O. and date Pattern of sequential invoices from a vendor Non-approved vendors Employee and vendor with same: Name, address, bank, etc. 7 2014
3 rd Wave: Predictive Analytics Predictive Analytics focuses our attention on important / suspect transactions. Comes in many different flavors o Each somewhat more sophisticated o Each making audit work more accurate and our lives easier (GTAG 16, 2011, The use of data analysis can significantly reduce audit risk by honing the risk assessment and stratifying the population ) Personal Computers Electronic Spreadsheets The end of hand calculation ERPs data in one place Database analysis Age of Rules Predictive Analytics Sophisticated Statistical Insights True Predictive & Continuous Audit 8 2014
5 Levels of Predictive Analytics 1. Statistical Insights 2. Fuzzy Logic 3. Clustering 4. Predictive Modeling 5. Big Data Analytics 9 2014
Statistical Insights: Benford s Law The most famous name in forensic accounting does not belong to an accountant. In 1938 at the age of 55 he published a paper titled The Law of Anomalous Numbers. Benford s Law is a statement about the occurrence of digits in lists of data. Useful in detecting fraudulent invoices or other numbered documents. 10 2014 Frank Benford (1883-1948), an American physicist.
Benford s Law Distribution of 1 st Digits Benford s Distribution Observed Distribution 11 2014
Which to Investigate? For distributions that appear to be anomalous: 1. Calculate the Kolmogorov- Smirnov distance between the vendor s first digit distribution and the ideal Benford distribution. 2. Investigate those with the largest numerical scores. Benford s Law of first digit distribution follows a logarithmic pattern and applies to a large number of surprising datasets including country populations, Twitter users by follower count and many more. See testingbenefordslaw.com for more examples. Kolmogorov-Smirnov distance is the absolute value of the greatest distance between the cumulative distribution functions (CDF). Source: Graph: Pivotal, Inc., Machine Learning for Forensic Accounting, 2013 12 2014
Fuzzy Logic Duplicate Invoice Detection Problem: Deterministic rules expect key information to be exactly the same. Vendor name Address Phone Invoice amount Date Bank account TIN If the criteria is kept tight: Too many false negatives missed duplicates. If the criteria is made loose: Too many false positives result in too many items to investigate. 13 2014
Fuzzy Matching Using Natural Language Processing Vendors are considered close matches when: Vendor names Remit vendor Address & Phone Other text-based of your choosing are identical or sufficiently similar 14 2014 Steps in Natural Language Processing (NLP) 1. Tokenize the vendor names 2. Remove stop words and special characters (of, and, the, ) 3. Process synonyms and abbreviations. 4. Calculate the tf-idfs for each word (term frequency inverse document frequency) 5. Calculate the cosine similarity between documents to identify close matches
Fuzzy Matching in Numerical Strings Numerical Values (strings) are considered close when: Invoice IDs Edit distance is small Dates Are the same Are within 7 days of each other Are inversed (3/11/14 vs 11/3/14) Payments Amounts are identical Edit distances are small TINS, Bank Accounts, Other Numerics Edit distances are small Substitutions Additions Deletions Transposes Edit Distance calculated with the Damerau-Levenschtein value 15 2014
Fuzzy Matching Using as many features of the invoice as desired o Not limited to 3 dimensions 1. Determine the best distance metric for each dimension o o Some are text-based Others numerical strings 2. Calculate the distance between invoices 3. Adjust the measurement values to yield the best true positive result 4. Investigate any pair of invoices where the distance is within your threshold 16 2014
Clustering Identify Invoice Anomalies with Vendor Baselining Vendors will tend to have patterns in their billing but may have more than one pattern based on service, ordering business unit, specific users, delivery address, etc. There may be multiple normal behaviors. Identify the true outliers for investigation by: Payments ~$1,000 to $5,000 Bus Unit: Bldg Maintenance Users: Loc 1, Loc 2, Loc 3 Paid by ACH To address ABC Payments <$700 Bus Unit: Security Users: Loc Z Paid by check To address GHI Featurizing the invoices (see fuzzy logic) Run a clustering algorithm such as K-Means Identify clusters with low populations and low density as potential anomalies. Vendor A Payments >$100,000 Bus Unit: Construction Users: Loc 4 Paid by ACH To address DEF 17 2014
Predictive Modeling: Time Travel in the 21 st Century 18 2014
Type 1: Prediction by Scoring ML continuously monitors and scores from 1 to 100 examine only the high scoring items. Your Financial System Future You Do this once - ML learns what is FWA Examine lots of possible FWA invoices every month Machine Learning System Current You 19 2014
Type 2: Prediction by Actual Value Example from Insurance $ Premium SIC code # employees Address $ Sales N 1 N 100 Claim File N 1 N 100 Machine Learning System Historical data from many sources is combined to train the ML System to predict the correct $ premium Predicted Premium Actual Premium Paid variance $ 10,254 $ 9,946-3% $ 25,687 $ 26,971 5% $ 5,621 $ 5,452-3% $ 96,321 $ 98,247 2% $ 85,741 $ 72,880-18% Investigate the outliers 20 2014 Accuracy can be very high in the range of 90% to 98% based on historical data used.
So What is a Machine Learning System? ML Mathematical Cores Regression K-Means Bayesian Classifiers Decision Trees CART / CHAID Support Vector Machines Artificial Neural Nets (ANN) Genetic Programs Systems (very partial list) Advanced CAATS Pivotal Oversight (as a service) EMC Proprietary General Purpose SAS IBM SPSS RapidMiner Open Source Do It Yourself PSPP Weka R Python 21 2014
4th Wave: Big Data Analytics Big Data Analytics o Addresses new concerns regarding social media and other risks from text and image based sources. o Continues to improve the accuracy of predictive analytics further reducing false positives and false negatives. o Allows true continuous audit of even the largest enterprises as computation costs drop to fractions of previous investments. Personal Computers Electronic Spreadsheets The end of hand calculation ERPs data in one place Database analysis Age of Rules Predictive Analytics Statistical Insights True Predictive & Continuous Audit 22 2014
Got Big Data? Volume High Terabytes or Petabytes Very long retrieval and processing times Variety Structured Unstructured Semistructured All at once Velocity Batch Near time Real Time Streams 23 2014
It s Really About Big Data Technology Search & Retrieve The database Source: EMC 24 2014
What are Big Data Analytics? 1 st The haystack gets a lot bigger Traditional structured data Unstructured data o Documents o Email o Web content o Social Media 2 nd Thanks to Hadoop and Massive Parallel Processing Query and retrieval times are short Cost of even massive storage is very low 3 rd Many predictive modeling techniques can also be applied to structured and unstructured data Models become more accurate 4 th New techniques for unstructured data based on NLP Sentiment analysis 25 2014
Focus on Social Media Risks* *Risk also arises from other types of unstructured and semi-structured data: Email Internal documents Images stored centrally or on users machines 26 2014
Social Media Risks 7.3 6.9 6.6 6.1 5.6 5.5 4.9 4.9 4.0 2.9 0 1 2 3 4 5 6 7 They gave me financial aid then I cancelled all my classes and kept the money Sit in at the Chancellor s Office at 3:00 Joe sold me the answers to tomorrow s test Can t believe how much I made on ebay today I ll fix them. I put a virus on the lab computer. Professor X is such a perv The instructor said I could make money after school fixing cars in the auto shop I just downloaded a bunch of student financial data from the finance system I found out they re cutting my budget. I m going to the union before this gets out Did you hear we re losing accreditation. Don t sign up next term. 27 2014 Source: 2014 Internal Audit Capabilities and Needs Survey Report, Protiviti
You Don t Need to be a Data Scientist, Just a Smart Tool User The Age of Smart CAATs Personal Computers Electronic Spreadsheets The end of hand calculation ERPs data in one place Database analysis Age of Rules Predictive Analytics Statistical Insights True Predictive & Continuous Audit Social media, text, image Improved accuracy Cost effective continuous audit 28 2014
Questions Contact Information Bill Vorhies President & Chief Data Scientist Data-Magnum Bill@Data-Magnum.com www.data-magnum.com 818.257.2035 I shall find a way or make one. Admiral Robert Peary Big Data & Predictive Analytics 29 2014