Case Study Detecting Credit Card Fraud Analysis of Behaviometrics in an online Payment environment
Introduction BehavioSec have been conducting tests on Behaviometrics stemming from card payments within a Payment Service Provider (PSP). The live data was recorded when payers entered the standard credit card number, name and additional information. The request came from a customer based in the Nordics looking for additional technologies to enhance its Web Fraud Detection (WFD) offering. In a card-not-present situation Behaviometrics offers a new approach that existing fraud checks have failed to identify; the human behind the payment and whether or not it is the right cardholder conducting the transaction. BehavioSec supplied the customer with BehavioWeb to integrate into an existing customer s payment page. The merchant collected Behaviometrics from 2371 individuals which generated four transactions in average total records reaching 9500. Nearly all users had two or more data records which are the minimum amount of records required to be able to build a behavioral profile and be able to perform a test. One fifth of the users had five or more transaction records. Five transactions is a good trade off point between learning time and the accuracy of the investigation and the average amount of transactions conducted in one month by Internet bankers. The system accurately detects the payer 4,5 out of the 5 times just by the way the person types their card information. The system becomes better over time and around 20 times equal to 4 months usage the system reaches 97% accuracy. Card payers have flexibility in where to spend their money and complete transaction at other services before returning to the merchant. Therefore identification of just not the correct user is of interest but also detection of suspicious usage to spot fraudsters between different accounts. For detection mode the system was able to reach 87% accuracy in recognizing the person attempting to use another person s card. The results clearly show that there is divergence between how users interact with the merchant s check-out page and entering payment information and a possibility to combat fraud by user behavior. 100,00% 80,00% 60,00% 40,00% 20,00% 0,00% 0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90
Contents 1 Background... 1 1.1 Intended audience... 1 2 Definitions... 2 3 How the results are calculated... 3 3.1 Overlapping behavioral patterns... 3 3.2 Measuring the accuracy... 4 3.3 Cross examination to simulate impostors... 4 4 Analysis... 5 4.1 Dataset statistics... 5 4.2 Observations and delimitations... 6 5 Results and conclusion... 7 5.1 Authentication... 7 5.1.1 Accuracy based on empiric data (for authentication purposes) 7 5.1.2 Usability analysis... 7 5.1.3 List of suspicious transactions... 8 5.2 Investigation / Forensics... 8 5.2.1 Usability analysis... 8 5.3 Conclusions... 9 6 Summary... 10 7 Further reading... 11
1 Background BehavioSec is the innovator in Continuous Verification of end users through Behaviometrics (behavioral biometrics). Our on-line offering, BehavioWeb, is a solution to monitor and analyze behavior based on the interactions with a web page to enhance trustworthy communications. By timing each key press and analyzing the timing deltas to subsequent key action (up & down) for each key pair, the software builds up a profile of the user to be used in order to detect consistency. Through this analysis the software collects Behaviometrics of the user s normal usage patterns via this small statistical data on any transaction. The server side software will perform a risk analysis on the data and gives a scoring that is the similarity to the correct user. By looking at user s various Behaviometrics the software can determine the transactional risk level, send alarms to alert investigators and, if existing infrastructure is in place, take steps to prevent fraudulent usage by requesting additional authentication to take place. A detailed forensic trail of the events and a comparison against specific fraud profiles identified is presented in the management dashboard to allow thorough investigations and speed up the fraud case management. BehavioWeb evaluates an individual s typing behavior against their and all other individuals history. The software is constantly adapting to the end user s changes in behavior and is updating its risk evaluations without manual configuration. The purpose of this document is to illustrate how BehavioWeb would perform in a live payment environment. 1.1 Intended audience This report is designed for people responsible for e-commerce Payment, Risk Assessment & Management, System Design, Fraud Management and Transaction Monitoring as well as IT and/or Security personnel. This document does not require specific technology knowledge, but it refers to many concepts without providing explanation to the terminology. These terms are used in their industry-standard meaning, and their definitions can be found in various sources, including the definition list in this document. 1
2 Definitions Record/Sample Profile Insertion/Update Score Threshold A record/sample is the blob of behavioral data that is collected when typing in a text field. The profile is much like a fingerprint of the behavior which is unique for each individual user. The fingerprint is built by collecting and analyzing samples. Insertions/updates is a measurement of how many times that a profile has been updated with data from new samples. When comparing a collected sample against a profile a score between 0.0 and 1.0 is calculated. The higher the score, the more probable it is that the sample comes from the correct person. A threshold can be used to separate the impostor from the correct user and have a direct link to the False Accept Ratio (FAR) and False Reject Ratio (FRR). If the score is above the threshold it is considered to be the correct user, if the score is below the threshold it is considered to be an impostor. The threshold can be set on a range between 0.0 and 1.0. False Accept Ratio (FAR) The statistical ratio (%) of samples that incorrectly scores above the threshold. E.g. the percentage of patterns that we know belong to an incorrect user and that is falsely accepted as the correct user. A high threshold makes it less likely for incorrect samples to be accepted. False Reject Ratio (FRR) The statistical ratio (%) of samples that incorrectly scores below the threshold. E.g. the percentage of patterns that we know belong to the correct user and that is falsely rejected as the correct user. A low threshold makes it less likely for the correct samples to be rejected. Equal Error Rate (ERR) Equal Error Rate the point (threshold) at which the curves for FAR and FRR intersects. It is the point on which FAR and FRR is equal. It is used to determine the accuracy of a system. 2
3 How the results are calculated Biometrical systems generally separate impostors from a correct user by matching a score against a threshold. The score is how similar a sample and a template is; the higher score the more similar they are. The threshold is a line that says that all scores above this line is considered to be the correct user while all scores that are below the threshold is considered to be an impostor. Looking at the figure below, the samples 1, 2 and 3 would be considered to be from the correct user while sample 3 and 4 would be considered as impostors. 100 80 Score 60 40 20 0 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 The false accept rate (FAR) is the percentage of samples that are incorrectly accepted (match between input and a non-matching template). The false reject rate (FRR) is the percentage of samples that are incorrectly rejected (fails to detect a match between input and matching template). 3.1 Overlapping behavioral patterns In general, the matching algorithm performs a decision based on a threshold which determines how close to a template the input needs to be for it to be considered a match. If the threshold is reduced, there will be less false rejects but more false accepts. Correspondingly, a higher threshold will reduce the false accept rating but increase the false reject rating. In some cases it is possible that the impostor patterns generate scores that are higher than the patterns from the user which leads to classification errors. Depending on the threshold, a range between all and none of the impostor patterns are falsely accepted by the system. The choice of threshold value is a problem if the scoring distribution of the correct user and impostor overlap. 3
User scores Impostor scores Frequency Score In theory, the correct users should always score higher than the impostors. A single threshold could then be used to separate the correct user from the impostors. 3.2 Measuring the accuracy The Equal Error Rate (EER) indicates the accuracy of the system. The EER is calculated by studying where the FAR and FRR intersect (the threshold level in which the FAR and FRR have the same value). The lower the EER, the more accurate the system is considered to be. The relationship between False Accepts and False Rejects in contrast of threshold levels is best described with a Receiver Operating Characteristic (ROC) curve. An ROC curve is a graphical representation of the tradeoff between the false negative and false positive rates for every possible threshold level. If the threshold is reduced there will be less false rejects but more false accepts. A higher threshold will reduce the FAR but increase the FRR. Accept / Reject Ratio (%) 100 80 60 40 20 0 Example ROC Curve FAR FRR EER 0 10 20 30 40 50 60 70 80 90 100 Threshold level 3.3 Cross examination to simulate impostors In order to calculate the FRR we can simply compare samples from a user with its own profile and counting all the false rejects. To calculate the FAR we need to simulate intrusion attempts, this is done by comparing against records from a user that we know belong to another. 4
4 Analysis Below is a summary of the dataset that has been analyzed. The distribution of records indicates the number of users that have the exact number (==) of records as well as how many of the users that have more or equal (>=) to the specified number of records. 4.1 Dataset statistics Number of users 2371 Number of records 9736 Average number of samples Input fields 4.10 CreditCardHolder CreditCardNumber CreditCardCCV Anonymous Anonymous Distribution of records # Records # Users (%) More or equal 1 15 (0.63%) 100.0% 1 2 3 4 5 6 7 8 9 10 15 20 24 50,99% 30,92% 20,79% 15,61% 11,89% 9,41% 7,84% 5,82% 2,83% 1,69% 1,10% 100,00% 99,37% 2 1147 (48.38%) 99.37% 3 476 (20.08%) 50.99% 4 240 (10.12%) 30.92% 5 123 (5.19%) 20.79% 6 88 (3.71%) 15.61% 7 59 (2.49%) 11.89% 8 37 (1.56%) 9.41% 9 186 (7.85%) 7.84% 10 21 (0,81%) 5.82% 15 8 (0.34%) 2.83% 20 2 (0.08%) 1.69% 24+ 26 (1.10%) 1.10% 5
4.2 Observations and delimitations 99.37% of the users had two or more data records which are the minimum amount of records required to be able to build a behavioral profile and be able to perform a test. 20.79% of the users had 5 or more transaction records. 5 transactions is a good trade off point between learning time and the accuracy of the investigation. To calculate False Reject Rating (FRR) we assume that it is the correct person that has accessed the account. The False Accept Rates (FAR) are for forensic/investigation mode (the ability to pin out the correct user from the entire user base based on the transaction record). Profiles built over a longer period of time and over different input fields will be more complete (statistics of more key combinations), making investigation mode more accurate. o The accuracy of the investigation mode would greatly benefit from collecting keystroke records from more fields and forms. It is not possible to calculate the False Accept Rate (FAR) for authentication purposes using this dataset because: o For example, the names Anders and Felix only have one common letter (e). Depending on the type of field and environment that Behavio is deployed in this can negatively impact the results in investigation mode. For authentication it would be different. If Felix would impersonate Anders and try to make a transaction as Anders; then Felix would enter Anders as his name which would enable Behavio to compare the entire key sequence. o To achieve higher accuracy on anonymous fields the user has to type the same thing every time. If the user changes for example password, then the profile should be cleared. This is linked to the situation above. 6
5 Results and conclusion Below are the results from the dataset. The results are split into two different running modes to illustrate the different use scenarios and what can be expected from them. 5.1 Authentication When running BehavioWeb in authentication mode the system will compare the keystroke record collected during the transaction with the behavioral profile that is associated with the user (1:1 match). 5.1.1 Accuracy based on empiric data (for authentication purposes) The following accuracy calculations are based on data where the users have been participating in a controlled test environment. The updates column the training level of the behavioral profile and the second column is the accuracy for BehavioWeb at that training level. Updates Accuracy (1-EER) 0 Not possible 1 ~ 70% 2 ~ 75% 3 ~ 80% 4 ~ 91% 5 ~ 92% 10 ~ 95% 20 ~ 97% By looking at the table above we can see that starting from the first profile update the accuracy of the system is 70%. After 5 updates the accuracy starts to pan out and is fairly consistent at around 97% after 20 updates. 5.1.2 Usability analysis To achieve accuracy over 90% a training history of 5 transaction records are desirable but already after 3 transactions we see a significant difference between users (with a ~80% accuracy). Only 20.79% of the user base in the retrieved dataset fulfills the desirable amount but over half of the users fulfill the 3 transaction threshold. 7
Training level Accuracy % of user base 1 ~ 70% 100% 2 ~ 75% 99.37% 3 ~ 80% 50.99% 4 ~ 91% 30.92% 5 ~ 92% 20.79% 10 ~ 95% 5.82% 20 ~ 97% 4.93% 5.1.3 List of suspicious transactions Out of all transaction a shortlist of suspicious transactions was presented in which roughly 6 % were marked for further investigation. The criterion to be listed is that the user should have made at least 5 transactions and below get a score below 10%. 5.2 Investigation / Forensics When running BehavioWeb in authentication mode the system will compare and rank the results against a selected range of behavioral profiles (1:n match). 5.2.1 Usability analysis Below is the RoC curve for investigation mode illustrating the FAR and FRR over different threshold and training levels. It clearly shows that there is a significant difference between the users and should be able to single out the correct user from a bigger set by comparing a single keystroke record. 100,00% 75,00% RoC Ratio 50,00% 25,00% 0,00% 0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90 Threshold The accuracy of the system is the likelihood that it is the correct user that comes out on top in an investigation. To analyze the accuracy the Equal Error Rates (where the FAR and FRR intersect) for different training levels are calculated, results are shown in the graphs below. 8
Equal Error Rate 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% Equal Error Rate 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Number of samples With the current setup of one regular form field and two anonymized we can see that the accuracy peak is just above 87% for investigation/forensic mode. This is achievable if the user has around 10 previous keystroke records from which BehavioWeb have learnt the behavior. This applies for 5.82% of the user base in the data set. Accuracy 90,00% 85,00% 80,00% 75,00% Accuracy 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Number of samples 99.37% of the user base has two or more records which would guarantee that the minimum achievable accuracy is 80% across all users. Approximately 20% of the users would be able to achieve ~85% accuracy (based on 5 keystroke records). 5.3 Conclusions For authentication/verification purposes BehavioWeb over 50% of the users would have ~80% accuracy, meaning that the system would classify the user correctly 80% of the times. Accuracy at 90% is desirable and that would address ~31% of the data set user base. Optimal amount of training is 10 keystroke records which results in over 95% accuracy (and pan out around 97%). Using an approach that allows the user to try again after a failed verification before the transaction is flagged as fraudulent would increase the overall accuracy of the system (false reject rate is lower exponentially). For investigation/forensics purposes scenarios it is possible, with the current set up to reach 87% accuracy. Investigation mode would greatly benefit from not using anonymous fields and/or collecting keystroke records from other forms/fields. Since a lot of users only had one or two keystroke records in the dataset collecting more data over a longer period would enable the higher accuracy levels for more users. 9
6 Summary There is no silver bullet to solve the identity problem on the Internet. Concerned parties need to enlist every tool in their arsenal to stay ahead of fraud and identity attacks. To secure transactions one must implement the security pillars of something you have, something you know and something you are to create a nonrepudiate session. Our technology helps in such a multi-layered authentication approach. With Behaviometrics you can reach the trustworthiness of knowing that it is the correct user without having to sacrifice the comfort of using knowledge based and strong authentication i.e. a password and a hardware/software token. Looking at behavior is not new. Card issuers looking for strange usage to determine risk is common place. This manifests itself with cards being blocked when used in strange locations for odd purchases. Using our technology to determine risk is this approach applied to the Internet. The technology also has applications in detecting human access vs. automated (bots), detecting multiple account registrations, and in forensics where transactions determined to be fraudulent can be examining not for not being the correct user but who that user is likely to be. We can match transaction profiles against known fraudster profiles in a central database to help fraud case management and speed up investigations. In comparison to traditional authentication and biometrics that offer a one-off approach, either yes or no, a Behaviometric solution gives a similarity to the known behavior. Couple that with existing risk engines prediction of how accurate the scoring is based on multiple variables it gives a confidence in the identity of a user without impacting the end user experience. Compare it to swiping a fingerprint whenever a transaction occurs but without the hassle of additional hardware or requiring intrusive information. 10
7 Further reading BehavioWeb http://www.behaviosec.com/images/behavioweb_product_sheet.pdf Product sheet BehavioWeb - A paradigm shift in internet security http://www.behaviosec.com/images/behavioweb-whitepaper.pdf Whitepaper Mouse Dynamics http://www.behaviosec.com/images/mouse-dynamics_whitepaper.pdf Whitepaper Behaviometrics - A paradigm shift in computer security http://www.behaviosec.com/images/behaviometrics_whitepaper.pdf Whitepaper Behavio Enterprise http://www.behaviosec.com/images/behavio_product_sheet.pdf Product sheet 11