Fraud Detection with MATLAB Ian McKenna, Ph.D. 2015 The MathWorks, Inc. 1
Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 2
Fraud Detection Detecting when people intentionally act secretly to deprive another of something of value Types Returns Forensics Linguistic Based Cues http://nakedshorts.typepad.com/files/madoff_fairfieldsentry3x.pdf 4
Types of Fraud Corporate Financial statement falsification Securities and commodities Hedge Fund returns manipulation Stock markets manipulation, regulation compliance Healthcare Mortgage Identity theft (credit card) Insurance Mass marketing Asset forfeiture/money laundering 5
Hedge Fund Returns Manipulation More prone to fraud due to decreased regulation SEC stats indicate 1% misbehave Scenarios Misbehavior: HF managers that have some discretion in valuing illiquid investments. Academics have devised methods to analyze and flag potentially manipulated fund returns. Outright fraud: Quantitative screening and use of dedicated algorithms can save a lot of time 6
Return-Based Analysis # of negative monthly returns used to judge manager s performance Attract investors by misreporting returns Distortion possible for returns at manager s discretion Illiquid assets, complex assets E.g. discontinuity exists at zero but disappears if returns computed bimonthly Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud. Bollen, Nicolas P.B. and Veronika K. Pool (2012) Review of Financial Studies 25, 2673-2702. 7
Returns Distribution Discontinuity 9
Benford s Law Frequency distribution of digits in many real-life sources of data: Electricity bills Street addresses Stock prices Population numbers Death rates Physical and mathematical constants Processes described by power laws 10
Stock Market Returns First Digit Frequency Source: Checking Financial markets via Benford's law, Marco Corazza, Andrea Ellero, and Alberto Zorzi 11
Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 12
Challenges in Fraud Detection Cost/Economics Most cases not fraud Manual analysis Data Huge data sets Complex data types Data integration Change Evolutionary Secrecy in detection methods 13
Challenges Faced During Model Development Traditional Approach Off-the-shelf software In-house development with traditional languages Spreadsheets, Excel Combination of the above Challenge Inability to work with custom and complex data Adapting requires long development times Limited data size Inefficiencies in Integration & Automation 15
Computational Finance Workflow Access Files Research and Quantify Data Analysis & Visualization Share Reporting Databases Financial Modeling Applications Datafeeds Application Development Production Automate 16
The Desired Report Three funds to analyze and report: Gateway Fund American Funds Growth Fund Fairfield Sentry (known fraudulent Madoff fund) 17
Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 18
Implemented Methods Returns Based Returns distribution and discontinuity at 0 Check discontinuity at 0 of the distribution of monthly returns Low correlation with other assets Regress fund returns on a combination of style factors that maximize explanatory power of the analysis Unconditional serial correlation Check if monthly returns are serially correlated, i.e. correlated with their previous month value. Because managers investing in illiquid securities, with no end-of-month quoted price, may smooth their returns compared to all available market information Conditional serial correlation Using the optimal factor model constructed in Low correlation with other assets, check serial correlation occurring especially after a down month (i.e. when the suspicious managers has the highest incentive to catch up ) 20
Implemented Methods Returns Based Number of returns equal 0 Calculate the theoretical number of returns being 0, using cumulative distribution function and binomial coefficients, for a time series exhibiting the same characteristics (average returns and variance) as the fund. Then compare that number with the actual count. Number of negative returns Calculate the theoretical number of negative returns as above. Then compare that number with the actual count. Number of unique returns/length of identical recurring series Calculate the theoretical number of each patterns. Unique returns is the number of unique numbers in the time series and length of identical series is the number of consecutive observations that are identical. Then compare these statistical numbers with the actual count. 21
Implemented Methods Returns Based Sample distribution of the last digit Check if the distribution of the returns last digit is uniformly distributed with a goodness-of-fit test Sample distribution of the first digit Check if the distribution of the returns first digit is following the Benford s Law with a goodness-of-fit test Supervised classification methods Using machine learning tools (such a Neural Networks, Classification methods) train a model to identify potential fraudsters. Input variables consists of all of the indicators described above so far, attributed to previously identified fraudulent and non fraudulent fund. Apply the fitted model to a new fund to obtain its classification. 22
Text Based Indicators Idea from published research in criminal investigation Hypothesis - deceptive senders display: Higher quantity Higher expressivity Higher informality Higher uncertainty Higher nonimmediacy Lower complexity Lower diversity Lower specificity Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication. LINA ZHOU, Department of Information Systems, University of Maryland, Baltimore County, MD, USA. JUDEE K. BURGOON, JAY F. NUNAMAKER, JR. AND DOUG TWITCHELL, Center for the Management of Information, University of Arizona, Tucson, AZ, USA. Group Decision and Negotiation 13: 81 106, 2004 24
Implemented Methods Text Based Measure Complexity Average number of statements (average concepts per sentence) Average sentence length (average complexity of structures) Vocabulary complexity (average word length) Measure Uncertainty Average use of modifiers (number of adjectives/adverbs per sentence) Average reference to other (number of he, they, ) Measure of Expressivity Emotiveness (number of adjectives compared to nouns) Measure of Diversity Lexical diversity (number of unique words) 25
Classifying Words Java POS Tagger Reference online dictionary Only a few line of code 26
Comparison: American Growth Fund 28
Comparison: Madoff 29
Next Steps: Machine Learning with MATLAB To learn more, visit: www.mathworks.com/machine-learning Basket Selection using Stepwise Regression Classification in the presence of missing data Regerssion with Boosted Decision Trees Hierarchical Clustering 31
MATLAB Solutions Traditional Approach Challenge Solution Off-the-shelf software In-house development with traditional languages Spreadsheets, Excel Combination of the above Inability to work with custom and complex data Adapting requires long development times Limited data size Inefficiencies in Integration & Automation Flexible Work Rapid P Advan Work w Datab Easy to Autom 32
Financial Modeling Workflow Access Files Databases Datafeeds Research and Quantify Data Analysis and Visualization Financial Modeling Application Development Share Reporting Applications Production Spreadsheet Link EX Database Datafeed Trading Financial Instruments Statistics & Machine Learning Financial Econometrics Optimization Report Generator Production Server MATLAB Compiler SDK MATLAB Compiler MATLAB Parallel Computing MATLAB Distributed Computing Server 33
Q&A 34