Recognizing The Theft of Identity Using Data Mining



Similar documents
Online Credit Card Application and Identity Crime Detection

Fighting Identity Fraud with Data Mining. Groundbreaking means to prevent fraud in identity management solutions

An Overview of Knowledge Discovery Database and Data mining Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Data Mining

Data Mining Solutions for the Business Environment

Dan French Founder & CEO, Consider Solutions

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Credit Card Fraud Detection Using Self Organised Map

Meeting Identity Theft Red Flags Regulations with IBM Fraud, Risk & Compliance Solutions

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Warehousing and Data Mining in Business Applications

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

An effective approach to preventing application fraud. Experian Fraud Analytics

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Application of Hidden Markov Model in Credit Card Fraud Detection

Prediction of Heart Disease Using Naïve Bayes Algorithm

Data Mining Part 5. Prediction

EXTENDED CENTROID BASED CLUSTERING TECHNIQUE FOR ONLINE SHOPPING FRAUD DETECTION

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Role of Neural network in data mining

Healthcare Measurement Analysis Using Data mining Techniques

Tax Fraud in Increasing

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Credit Card Fraud Detection using Hidden Morkov Model and Neural Networks

SPATIAL DATA CLASSIFICATION AND DATA MINING

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining Application for Cyber Credit-card Fraud Detection System

Introduction to Data Mining

Customer Classification And Prediction Based On Data Mining Technique

Machine Learning and Data Mining. Fundamentals, robotics, recognition

FRAUD DETECTION AND PREVENTION: A DATA ANALYTICS APPROACH BY SESHIKA FERNANDO TECHNICAL LEAD, WSO2

BIG DATA What it is and how to use?

Unsupervised Outlier Detection in Time Series Data

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

A Survey on Intrusion Detection System with Data Mining Techniques

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

KEITH LEHNERT AND ERIC FRIEDRICH

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

Football Match Winner Prediction

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

The Data Mining Process

A Review of Data Mining Techniques

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

The University of Jordan

GETTING REAL ABOUT SECURITY MANAGEMENT AND "BIG DATA"

Research of Postal Data mining system based on big data

Azure Machine Learning, SQL Data Mining and R

Recognize the many faces of fraud

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

A survey on Data Mining based Intrusion Detection Systems

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

A Fast Fraud Detection Approach using Clustering Based Method

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

Web Forensic Evidence of SQL Injection Analysis

Insider Threat Detection Using Graph-Based Approaches

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Pentaho Data Mining Last Modified on January 22, 2007

Chapter 12 Discovering New Knowledge Data Mining

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

A Content based Spam Filtering Using Optical Back Propagation Technique

Predictive Data modeling for health care: Comparative performance study of different prediction models

Immune Support Vector Machine Approach for Credit Card Fraud Detection System. Isha Rajak 1, Dr. K. James Mathai 2

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

OPERA SOLUTIONS CAPABILITIES. ACH and Wire Fraud: advanced anomaly detection to find and stop costly attacks

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

E-commerce Transaction Anomaly Classification

Using Data Mining for Mobile Communication Clustering and Characterization

Evaluating Online Payment Transaction Reliability using Rules Set Technique and Graph Model

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Data Mining: Overview. What is Data Mining?

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Data Mining Applications in Higher Education

ClusterOSS: a new undersampling method for imbalanced learning

SAS Fraud Framework for Banking

Transcription:

Recognizing The Theft of Identity Using Data Mining Aniruddha Kshirsagar 1, Lalit Dole 2 1,2 CSE Department, GHRCE, Nagpur, Maharashtra, India Abstract Identity fraud is the great matter of concern in the field of the e-commerce. Identity Fraud is more than a security issue it is a financial burden as in transaction and application domain the culprit can make serious problems for the victims as they can be affected with unethical activity by using victims private information for the economic gain. Application fraud is one of the prominent example of identity fraud where the thief can use victims personal information for issuing the credit card account or loan. To counter this problem data mining based two step recognition system is proposed. This system contains two algorithms Communal Detection (CD) which checks for multi-attribute link and Spike Detection (SD) which checks for single attribute link. CD algorithm targets the communal relationships of the dataset while SD algorithm finds the spikes between the duplicates in the dataset. Together these two algorithms can be used for the theft detection in the application fraud. Also detect several attacks. Keywords Anomaly Detection, Application Domain, Data Stream, Identity Theft. I. INTRODUCTION Identity Crime may occur when someone steals victim s personal information, to open credit card accounts or taking various loans using the victim's name without authorization, and issues products to those accounts. Identity Crime is a substitute of unlawful identity change. It points to unauthorized activities by using the identity of another person or of a non-existing person as a primary tool for products procurement. Identity Crime can be committed by forging the related documents in two ways first one is the issuance of a genuine identity document, under a synthetic identity. A synthetic identity is made after several altered or forged identity documents, which are used to prove one s identity at the enrollment step. Synthetic identity fraud is the act of creating a virtual identity, to perpetrate criminal activities. The second one is, the illegitimate use of a genuine identity document. These can be harder to obtain but easier to successfully apply. In reality, identity crime can be done with a mix of both synthetic and real identity details. The manufacturing and use of forged identity documents is a financial burden for governments, social welfare institutions, and financial institutions. Prominently there are two domains where the identity crime is a big apprehension in financial field: Transaction and Credit Application. 556 Transaction domain is concerned with the identity crime in the online financial transaction which is done through credit card, online banking. Here fraudster does the transaction online through the victim s credit card or bank details disguised as the victim. In Credit Application domain the identity fraud is when someone applies for a credit card, mortgage loan, and home loan with false information. Fraud that involves e-commerce activities such as credit card transactions etc. denote significant problems for governments and businesses, however detecting and preventing fraud is not an easy job. Fraud is an adaptive crime, so it needs special methods of intelligent data analysis to detect and prevent it. These methods exists in the fields of Knowledge Discovery in Databases (KDD), Data Mining, Machine Learning and Statistics. They propose applicable and successful solutions in different areas of fraud crimes. II. DATA MINING: OVERVIEW Data mining is about searching for understandings which are statistically reliable, not known before and actionable from data. This data must be accessible, relevant, enough and clean. Also, the data mining problem must be precisely defined, cannot be solved by query and reporting tools, and directed by a data mining process model. Data mining is used to detect to classify, cluster and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud. So, if data mining results in discovering meaningful patterns, data turns into information and this information is used in detecting anomalies which results in fraud. The purpose of Data Mining tools is to have a knowledgeable understanding of the private data of the people, and of the activity logs of the document issuance system. Data Mining enables an all-inclusive view on the data related to one citizen, from the enrolment step to each transaction made with the identity documents. Data mining tools take data and construct a depiction of reality in the form of a model. The resultant model describes patterns and relationships existing in the data. From a process orientation, data mining activities fall into three general categories: Discovery-the process of finding hidden patterns in a database without a predetermined information or hypothesis about what the patterns may be.

Analytical Modelling-the process of using patterns found in the database and using them to predict the future. Criminal Analysis-the process of implying the mined patterns to find anomalous or unusual data elements. The data mining techniques can help companies in various fields by mining their expanding databases for useful, thorough transaction information. The use of Data Mining technology help in decreasing the amount of work of analysts and enables them to focus on investigating activities or Individuals that have been tagged as suspicious. III. LITERATURE SURVEY Logistic regression, neural networks, or Support Vector Machines (SVM), cannot achieve scalability or handle the extreme imbalanced class [1] in credit application data streams. As scam and lawful performance changes very often, the classifiers will not work up to the mark rapidly and the supervised classification algorithms will need to be trained on the new data. The training time taken for realtime credit application fraud detection is very high as the new training data have too many derived numerical attributes and too few known frauds. Separately many data mining algorithms have been used in fraud detection. Case-based reasoning (CBR) [6] is the known prior publication in the screening of credit applications. CBR looks for the toughest cases which have been misclassified by existing methods and techniques. For the recovery purpose it uses the threshold nearest neighbor matching. For the analysis purpose multiple selection criteria and resolution strategies are used to analyze the retrieved cases. Peer group analysis [5] displays interaccount performance over time. It compares the cumulative mean weekly amount between a target account and other similar accounts at subsequent time points. The suspicion score is taken as a threshold which determines the consistent distance from the center of the peer group. Break point analysis [5] displays intraccount performance over time. It spots sudden increases in weekly spending within a single account. The arrangement of the accounts are based on the t-test. Bayesian networks [3] discover simulated anthrax attacks from real emergency department data. Wong [2] surveys algorithms for finding suspicious activity in time for disease outbreaks. Goldenberg et al. [4] use time series analysis to track early symptoms of synthetic anthrax outbreaks from daily sales of retail medication and some grocery items. Existing methods uses supervised learning algorithms such as neural networks, SVM etc. but they cannot achieve scalability. Supervised learning algorithm needs training on new data. IV. PROPOSED METHODOLOGY As the previous methods uses supervised learning algorithms and have problems related to the scalability factor and requirement of the new training data for the identity theft detection in application fraud domain where the new stream of data is ever coming the unsupervised algorithms is thought of useful as there will not be problems concerning training of new data. For this the two unsupervised algorithms are used namely communal detection and spike detection the working of these algorithms are explained with the help of the figure of system architecture. As shown in this figure the dataset is taken as the input which has the records of credit applications further the data records are taken based on the timestamp then the Communal Detection(CD) and Spike Detection (SD) algorithms gives the combined suspicion score. The CD and SD algorithms are explained as follows: Fig. System Architecture CD Algorithm: This algorithm helps for the bank when there are applications from the users specifically it is used to verify the duplicity of the users that is either by changing their name or else mobile number. 557

The problem is that when the name is nearby same for pronouncing to the previously applied name from the same contact number of home, identical address as well as identical area then there could be a possibility that the user might be trying to do some scam through the card. This mechanism issues the white list of the users whose data is not at all similar with the other data of the users. If there is any identical data then we need to blacklist that user and go for the manual verification which is a step in advance than normal communal detection. The need for communal detection is defined here. When there are two applications where in alike kind of records exist with very minute changes, there could be possibility of they being related or the same person is applying twice. Communal Detection is a method where such criteria is looked after. It works on fixed set of attributes and it uses a white-list oriented approach the communal relationships are records with have near identical values on the chosen attributes. A white-list is constructed with entities that display more probabilities of communal relationships. The algorithm takes exponential smoothing factor, input size threshold, state of alert, threshold for the similarity between the string, threshold of the attribute, exact duplicate filter, link-types in existing white-list, affecting window and current application as input furthermore returns output as suspicion score along with parameter change and the new whitelist. The steps for the CD algorithm is as given: Let Vi is current un scored application and labeled with set of attributes (a1,a2 an) Is compared with previous scored application Vj (a1,a2, an). 1. Attribute Vector: It finds attributes that exceed string similarity threshold; generate multi-attribute links against link types in current white-list when their duplicates similarity is more than attribute threshold. The first step of the CD algorithm matches every current application s value against a moving window of previous application s values to find links. S(ei,j)={1 if sim(ai,aj)>tsim 0 otherwise S(E)= s(ei,j) Where S(E) is attribute weight score and ei,j is the single-attribute match between the current value and a previous value. The first case uses Jaro-Winkler(.) which is a case sensitive method which match the linking current value as well as previous values from an additional similar attribute by cross referring. In the second case it is based on a non match as the values are not alike. 2. Current Weight: Using first step s multi-attribute links examine single link score. In further step of the CD algorithm accounts for weights of the attributes moreover it matches all current application s link beside the white list to discover communal relationships furthermore it reduces their link score. Here Wk is over all current weight between Vi,Vj. Rx;link-type is the linktype of the current whitelist. Thiis formula contains three cases. The first one uses attribute weights. The second one gives the link score of the grey list and in the third one it looked whether there are multiattribute links or not. 3. Average Previous Score: Using applications given above which are linked to Step1, examine the average of the preceding scores. In this step of the CD algorithm the calculation of all linked previous application s score designed for inclusion into the present application s score. The scores of previous steps act as the proven threshold. S(Vj) is average previous score of previous example. EO(Vj) is the number of outlinks from the previous application. In this equation, the first case computes each earlier application s average score while the second case is applied if there is no multi-attribute link. S(Vj) is average previous score of previous example. 4. Suspicion score: The suspicion score of the third and fourth step is examined. The fourth step of the CD algorithm is the calculation of all current application s score with every link along with previous application score. S(Vi)= S(E)+wi,j+S(Vj) Here the score of every recent application using prior score of the application and every link present over there is calculated. 558

5. Data Quality Improvement: The adaptive CD algorithm exchanges one random parameter s effectiveness based on the suspicion score for efficiency. SD Algorithm: The spike detection process is essential in order to develop adaptively as well as resilience of the proposed solution for Credit crime detection. The spike detection complements communal detection which providing attribute weights. This algorithm takes current application, present step, filter for the time difference, similarity threshold, moreover exponential smoothing factor as input and returns output as suspicion score along with attribute weights. The steps of the SD algorithm are as given: Vi is current unscored application. 1. Attribute Vector: Here present application s value is checked with prior applications in order to discover links using the following equation. S(ei,j)={1 if sim(ai,aj)>tsim 0 otherwise S(E)= s(ei,j) Where ai,j is the single-attribute match between the current value with previous value. The first case uses Jaro- Winkler (.)(Gordon et al, 2007), which is a case sensitive method which match the linking current value as well as previous values from an additional similar attribute by cross referring.. Time (.) which remains time alteration measured in minutes. The second case occur a dissimilar values that are not constant or else continue too quickly. 2. Single Value spike detection: Based on first step s matches, the current value s Score is calculated. In this step the calculation of all single current value s score through assimilating each as well as every steps to find spikes. The before steps act at the same time as the established baseline level. S(ai,j)=(1-α)+S(ai,k)+α* (St(ai,k)/t-1; Where S (ai,k) is the current value score. α is the exponential smoothing factor. 3. Multiple Score Value: In this step every present application s score is calculated by means of all values scores beside with attribute weights. S(Vi)= S(ai,j)*Wk Here S(vi) is the of the current application suspicion score of SD. 4. CD attribute weights change. By the end of all present Mini discrete data stream, in this step of the SD algorithm updates the attribute weights used for CD. At the end of current application cd weight is updated. Where wk is the attribute weight of the SD applied to the CD attributes. V. RESULT This is the communal fraud scoring data set and the file contains 21 attributes and 16205 records. These records are filtered for the redundant or duplicates records. This data set does not contain any such false records hence the whole of these 16205 records are stored into the database. 559

In the time stamp given select the date for the number of applications for that particular date here 1/1/2004 date is selected. It contains 35 records. For this date link type was generated. Link type was generated based on the similarity between these records. If the record has very similar attributes then it is taken as fraud. If three attributes between two records then it is taken as a fraud. Record 0 and 2 contains greater than 3 attributes are similar so it is crime. For this weight also calculated. This is the weight of the link types and this is called as white list. For this white list attribute weight is calculated and based on this weight, single link is produced and suspicion score is calculated. Based on suspicion score parameter are changed. After the parameter are changed the spike detection (SD) also applied for this date. Again for 35 records of this date link type and weight is calculated. This time the Link type is created if the record contains four exact similar attributes. After calculating link type weight is calculated. Based on the Communal Detection (CD) algorithm, update a Spike Detection (SD) and weight is updated. Weight is calculated and then multiple score. Based on the Spike Detection (SD) weight, update the Communal Detection (CD) weight. At the end of every current data process, Spike Detection (SD) algorithm calculated and updates attribute weight for Communal Detection (CD). VI. CONCLUSION The system detects the fraud detection online credit card or loan application. This system is used to avoid the duplicates and from the fraudsters while applying the credit card or applying for any loan. Data mining algorithms are used this system. These algorithms namely communal detection and spike detection used to detect the multiple applicants. This system combing with the spike detection and communal detection algorithms are used to make the system more efficient and secure. The identity thief has limited time because not guilty people can detect the fraud concerned early and quickly the victim can take the needful action. REFERENCES [1] D. Hand, Classifier Technology and the Illusion of Progress, Statistical Science, vol. 21, no. 1, pp. 1-15,doi: 10.1214/088342306000000060, 2006. [2] W. Wong, Data Mining for Early Disease Outbreak Detection, PhD thesis, Carnegie Mellon Univ., 2004. [3] W. Wong, A. Moore, G. Cooper, and M. Wagner, Bayesian Network Anomaly Pattern Detection for Detecting Disease Outbreaks, Proc. 20th Int l Conf. Machine Learning (ICML 03), pp. 808-815, 2003. [4] A. Goldenberg, G. Shmueli, R. Caruana, and S. Fienberg, Early Statistical Detection of Anthrax Outbreaks by Tracking Over-the- Counter Medication Sales, Proc. Nat l Academy of Sciences USA(PNAS 02), vol. 99, no. 8, pp. 5237-5240, 2002. [5] R. Bolton and D. Hand, Unsupervised Profiling Methods for Fraud Detection, Statistical Science, vol. 17, no. 3, pp. 235-255, 2001. [6] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java. Morgan Kauffman, 2000. [7] A. Bifet and R. Kirkby Massive Online Analysis, Technical Manual, Univ. of Waikato, 2009. [8] Experian. Experian Detect: Application Fraud PreventionSystem,Whitepaper,http://www.experian.com/products/pd f/ experian_detect.pdf, 2008. [9] T. Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters, vol. 27, pp. 861-874, 2006, doi: 10.1016/j.patrec. 2005.10.010. 560