Fraud Detection with MATLAB Ian McKenna, Ph.D.



Similar documents
MATLAB for Use in Finance Portfolio Optimization (Mean Variance, CVaR & MAD) Market, Credit, Counterparty Risk Analysis and beyond

Algorithmic Trading with MATLAB Martin Demel, Application Engineer

How To Build A Trading Engine In A Microsoft Microsoft Matlab (A Trading Engine)

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Turning Data into Actionable Insights: Predictive Analytics with MATLAB WHITE PAPER

Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

Data Analysis with MATLAB The MathWorks, Inc. 1

Machine Learning with MATLAB David Willingham Application Engineer

Predictive Modeling Techniques in Insurance

Review on Financial Forecasting using Neural Network and Data Mining Technique

Machine Learning.

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Pentaho Data Mining Last Modified on January 22, 2007

Why is Internal Audit so Hard?

Azure Machine Learning, SQL Data Mining and R

Data Mining for Fun and Profit

Data Mining: Overview. What is Data Mining?

not possible or was possible at a high cost for collecting the data.

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Deploying MATLAB -based Applications David Willingham Senior Application Engineer

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

DATA MINING TECHNIQUES AND APPLICATIONS

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Credit Risk Modeling with MATLAB

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Information and Decision Sciences (IDS)

Financial Trading System using Combination of Textual and Numerical Data

Mortgage Broker Qualifying Standards (MBQS)

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Warehousing and Data Mining in Business Applications

Abdullah Mohammed Abdullah Khamis

Optimization applications in finance, securities, banking and insurance

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Introduction to MATLAB for Data Analysis and Visualization

Review on Financial Forecasting using Neural Network and Data Mining Technique

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Statistics Graduate Courses

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

A Proposed Prediction Model for Forecasting the Financial Market Value According to Diversity in Factor

Masters in Information Technology

Dan French Founder & CEO, Consider Solutions

Data Mining. Dr. Saed Sayad. University of Toronto

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

The Facets of Fraud. A layered approach to fraud prevention

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Web Data Mining: A Case Study. Abstract. Introduction

How To Detect Credit Card Fraud

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Masters in Human Computer Interaction

Masters in Advanced Computer Science

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Masters in Artificial Intelligence

An Introduction to Advanced Analytics and Data Mining

Masters in Networks and Distributed Systems

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

IT services for analyses of various data samples

Machine Learning Capacity and Performance Analysis and R

COMMON CORE STATE STANDARDS FOR

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

CUSTOMER Presentation of SAP Predictive Analytics

Recognize the many faces of fraud

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

MATLAB in Production Systems, Database Integration, and Big Data Eugene McGoldrick

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Leveraging Ensemble Models in SAS Enterprise Miner

Bringing Big Data Modelling into the Hands of Domain Experts

Sidney Winter Lecture Series. Judee K. Burgoon University of Arizona

Bayesian networks - Time-series models - Apache Spark & Scala

Data Mining Algorithms Part 1. Dejan Sarka

Algorithmic Trading Session 1 Introduction. Oliver Steinki, CFA, FRM

Meeting Identity Theft Red Flags Regulations with IBM Fraud, Risk & Compliance Solutions

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Model Combination. 24 Novembre 2009

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Students will become familiar with the Brandeis Datastream installation as the primary source of pricing, financial and economic data.

An Introduction to Data Mining

8. Machine Learning Applied Artificial Intelligence

How To Use Data Mining For Loyalty Based Management

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

An Overview of Knowledge Discovery Database and Data mining Techniques

Maschinelles Lernen mit MATLAB

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

Facilitating On-Demand Risk and Actuarial Analysis in MATLAB. Timo Salminen, CFA, FRM Model IT

Clustering Connectionist and Statistical Language Processing

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Masters in Computing and Information Technology

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Data Mining Part 5. Prediction

Transcription:

Fraud Detection with MATLAB Ian McKenna, Ph.D. 2015 The MathWorks, Inc. 1

Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 2

Fraud Detection Detecting when people intentionally act secretly to deprive another of something of value Types Returns Forensics Linguistic Based Cues http://nakedshorts.typepad.com/files/madoff_fairfieldsentry3x.pdf 4

Types of Fraud Corporate Financial statement falsification Securities and commodities Hedge Fund returns manipulation Stock markets manipulation, regulation compliance Healthcare Mortgage Identity theft (credit card) Insurance Mass marketing Asset forfeiture/money laundering 5

Hedge Fund Returns Manipulation More prone to fraud due to decreased regulation SEC stats indicate 1% misbehave Scenarios Misbehavior: HF managers that have some discretion in valuing illiquid investments. Academics have devised methods to analyze and flag potentially manipulated fund returns. Outright fraud: Quantitative screening and use of dedicated algorithms can save a lot of time 6

Return-Based Analysis # of negative monthly returns used to judge manager s performance Attract investors by misreporting returns Distortion possible for returns at manager s discretion Illiquid assets, complex assets E.g. discontinuity exists at zero but disappears if returns computed bimonthly Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud. Bollen, Nicolas P.B. and Veronika K. Pool (2012) Review of Financial Studies 25, 2673-2702. 7

Returns Distribution Discontinuity 9

Benford s Law Frequency distribution of digits in many real-life sources of data: Electricity bills Street addresses Stock prices Population numbers Death rates Physical and mathematical constants Processes described by power laws 10

Stock Market Returns First Digit Frequency Source: Checking Financial markets via Benford's law, Marco Corazza, Andrea Ellero, and Alberto Zorzi 11

Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 12

Challenges in Fraud Detection Cost/Economics Most cases not fraud Manual analysis Data Huge data sets Complex data types Data integration Change Evolutionary Secrecy in detection methods 13

Challenges Faced During Model Development Traditional Approach Off-the-shelf software In-house development with traditional languages Spreadsheets, Excel Combination of the above Challenge Inability to work with custom and complex data Adapting requires long development times Limited data size Inefficiencies in Integration & Automation 15

Computational Finance Workflow Access Files Research and Quantify Data Analysis & Visualization Share Reporting Databases Financial Modeling Applications Datafeeds Application Development Production Automate 16

The Desired Report Three funds to analyze and report: Gateway Fund American Funds Growth Fund Fairfield Sentry (known fraudulent Madoff fund) 17

Agenda Introduction: Background on Fraud Detection Challenges: Knowing your Risk Overview of the MATLAB Solution Connect to financial data sources Calculate fraud indicators Classify funds with machine learning Generate reports & deploy applications Questions & Answers 18

Implemented Methods Returns Based Returns distribution and discontinuity at 0 Check discontinuity at 0 of the distribution of monthly returns Low correlation with other assets Regress fund returns on a combination of style factors that maximize explanatory power of the analysis Unconditional serial correlation Check if monthly returns are serially correlated, i.e. correlated with their previous month value. Because managers investing in illiquid securities, with no end-of-month quoted price, may smooth their returns compared to all available market information Conditional serial correlation Using the optimal factor model constructed in Low correlation with other assets, check serial correlation occurring especially after a down month (i.e. when the suspicious managers has the highest incentive to catch up ) 20

Implemented Methods Returns Based Number of returns equal 0 Calculate the theoretical number of returns being 0, using cumulative distribution function and binomial coefficients, for a time series exhibiting the same characteristics (average returns and variance) as the fund. Then compare that number with the actual count. Number of negative returns Calculate the theoretical number of negative returns as above. Then compare that number with the actual count. Number of unique returns/length of identical recurring series Calculate the theoretical number of each patterns. Unique returns is the number of unique numbers in the time series and length of identical series is the number of consecutive observations that are identical. Then compare these statistical numbers with the actual count. 21

Implemented Methods Returns Based Sample distribution of the last digit Check if the distribution of the returns last digit is uniformly distributed with a goodness-of-fit test Sample distribution of the first digit Check if the distribution of the returns first digit is following the Benford s Law with a goodness-of-fit test Supervised classification methods Using machine learning tools (such a Neural Networks, Classification methods) train a model to identify potential fraudsters. Input variables consists of all of the indicators described above so far, attributed to previously identified fraudulent and non fraudulent fund. Apply the fitted model to a new fund to obtain its classification. 22

Text Based Indicators Idea from published research in criminal investigation Hypothesis - deceptive senders display: Higher quantity Higher expressivity Higher informality Higher uncertainty Higher nonimmediacy Lower complexity Lower diversity Lower specificity Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication. LINA ZHOU, Department of Information Systems, University of Maryland, Baltimore County, MD, USA. JUDEE K. BURGOON, JAY F. NUNAMAKER, JR. AND DOUG TWITCHELL, Center for the Management of Information, University of Arizona, Tucson, AZ, USA. Group Decision and Negotiation 13: 81 106, 2004 24

Implemented Methods Text Based Measure Complexity Average number of statements (average concepts per sentence) Average sentence length (average complexity of structures) Vocabulary complexity (average word length) Measure Uncertainty Average use of modifiers (number of adjectives/adverbs per sentence) Average reference to other (number of he, they, ) Measure of Expressivity Emotiveness (number of adjectives compared to nouns) Measure of Diversity Lexical diversity (number of unique words) 25

Classifying Words Java POS Tagger Reference online dictionary Only a few line of code 26

Comparison: American Growth Fund 28

Comparison: Madoff 29

Next Steps: Machine Learning with MATLAB To learn more, visit: www.mathworks.com/machine-learning Basket Selection using Stepwise Regression Classification in the presence of missing data Regerssion with Boosted Decision Trees Hierarchical Clustering 31

MATLAB Solutions Traditional Approach Challenge Solution Off-the-shelf software In-house development with traditional languages Spreadsheets, Excel Combination of the above Inability to work with custom and complex data Adapting requires long development times Limited data size Inefficiencies in Integration & Automation Flexible Work Rapid P Advan Work w Datab Easy to Autom 32

Financial Modeling Workflow Access Files Databases Datafeeds Research and Quantify Data Analysis and Visualization Financial Modeling Application Development Share Reporting Applications Production Spreadsheet Link EX Database Datafeed Trading Financial Instruments Statistics & Machine Learning Financial Econometrics Optimization Report Generator Production Server MATLAB Compiler SDK MATLAB Compiler MATLAB Parallel Computing MATLAB Distributed Computing Server 33

Q&A 34