Recognizing The Theft of Identity Using Data Mining

Transcription

1 Recognizing The Theft of Identity Using Data Mining Aniruddha Kshirsagar 1, Lalit Dole 2 1,2 CSE Department, GHRCE, Nagpur, Maharashtra, India Abstract Identity fraud is the great matter of concern in the field of the e-commerce. Identity Fraud is more than a security issue it is a financial burden as in transaction and application domain the culprit can make serious problems for the victims as they can be affected with unethical activity by using victims private information for the economic gain. Application fraud is one of the prominent example of identity fraud where the thief can use victims personal information for issuing the credit card account or loan. To counter this problem data mining based two step recognition system is proposed. This system contains two algorithms Communal Detection (CD) which checks for multi-attribute link and Spike Detection (SD) which checks for single attribute link. CD algorithm targets the communal relationships of the dataset while SD algorithm finds the spikes between the duplicates in the dataset. Together these two algorithms can be used for the theft detection in the application fraud. Also detect several attacks. Keywords Anomaly Detection, Application Domain, Data Stream, Identity Theft. I. INTRODUCTION Identity Crime may occur when someone steals victim s personal information, to open credit card accounts or taking various loans using the victim's name without authorization, and issues products to those accounts. Identity Crime is a substitute of unlawful identity change. It points to unauthorized activities by using the identity of another person or of a non-existing person as a primary tool for products procurement. Identity Crime can be committed by forging the related documents in two ways first one is the issuance of a genuine identity document, under a synthetic identity. A synthetic identity is made after several altered or forged identity documents, which are used to prove one s identity at the enrollment step. Synthetic identity fraud is the act of creating a virtual identity, to perpetrate criminal activities. The second one is, the illegitimate use of a genuine identity document. These can be harder to obtain but easier to successfully apply. In reality, identity crime can be done with a mix of both synthetic and real identity details. The manufacturing and use of forged identity documents is a financial burden for governments, social welfare institutions, and financial institutions. Prominently there are two domains where the identity crime is a big apprehension in financial field: Transaction and Credit Application. 556 Transaction domain is concerned with the identity crime in the online financial transaction which is done through credit card, online banking. Here fraudster does the transaction online through the victim s credit card or bank details disguised as the victim. In Credit Application domain the identity fraud is when someone applies for a credit card, mortgage loan, and home loan with false information. Fraud that involves e-commerce activities such as credit card transactions etc. denote significant problems for governments and businesses, however detecting and preventing fraud is not an easy job. Fraud is an adaptive crime, so it needs special methods of intelligent data analysis to detect and prevent it. These methods exists in the fields of Knowledge Discovery in Databases (KDD), Data Mining, Machine Learning and Statistics. They propose applicable and successful solutions in different areas of fraud crimes. II. DATA MINING: OVERVIEW Data mining is about searching for understandings which are statistically reliable, not known before and actionable from data. This data must be accessible, relevant, enough and clean. Also, the data mining problem must be precisely defined, cannot be solved by query and reporting tools, and directed by a data mining process model. Data mining is used to detect to classify, cluster and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud. So, if data mining results in discovering meaningful patterns, data turns into information and this information is used in detecting anomalies which results in fraud. The purpose of Data Mining tools is to have a knowledgeable understanding of the private data of the people, and of the activity logs of the document issuance system. Data Mining enables an all-inclusive view on the data related to one citizen, from the enrolment step to each transaction made with the identity documents. Data mining tools take data and construct a depiction of reality in the form of a model. The resultant model describes patterns and relationships existing in the data. From a process orientation, data mining activities fall into three general categories: Discovery-the process of finding hidden patterns in a database without a predetermined information or hypothesis about what the patterns may be.

2 Analytical Modelling-the process of using patterns found in the database and using them to predict the future. Criminal Analysis-the process of implying the mined patterns to find anomalous or unusual data elements. The data mining techniques can help companies in various fields by mining their expanding databases for useful, thorough transaction information. The use of Data Mining technology help in decreasing the amount of work of analysts and enables them to focus on investigating activities or Individuals that have been tagged as suspicious. III. LITERATURE SURVEY Logistic regression, neural networks, or Support Vector Machines (SVM), cannot achieve scalability or handle the extreme imbalanced class [1] in credit application data streams. As scam and lawful performance changes very often, the classifiers will not work up to the mark rapidly and the supervised classification algorithms will need to be trained on the new data. The training time taken for realtime credit application fraud detection is very high as the new training data have too many derived numerical attributes and too few known frauds. Separately many data mining algorithms have been used in fraud detection. Case-based reasoning (CBR) [6] is the known prior publication in the screening of credit applications. CBR looks for the toughest cases which have been misclassified by existing methods and techniques. For the recovery purpose it uses the threshold nearest neighbor matching. For the analysis purpose multiple selection criteria and resolution strategies are used to analyze the retrieved cases. Peer group analysis [5] displays interaccount performance over time. It compares the cumulative mean weekly amount between a target account and other similar accounts at subsequent time points. The suspicion score is taken as a threshold which determines the consistent distance from the center of the peer group. Break point analysis [5] displays intraccount performance over time. It spots sudden increases in weekly spending within a single account. The arrangement of the accounts are based on the t-test. Bayesian networks [3] discover simulated anthrax attacks from real emergency department data. Wong [2] surveys algorithms for finding suspicious activity in time for disease outbreaks. Goldenberg et al. [4] use time series analysis to track early symptoms of synthetic anthrax outbreaks from daily sales of retail medication and some grocery items. Existing methods uses supervised learning algorithms such as neural networks, SVM etc. but they cannot achieve scalability. Supervised learning algorithm needs training on new data. IV. PROPOSED METHODOLOGY As the previous methods uses supervised learning algorithms and have problems related to the scalability factor and requirement of the new training data for the identity theft detection in application fraud domain where the new stream of data is ever coming the unsupervised algorithms is thought of useful as there will not be problems concerning training of new data. For this the two unsupervised algorithms are used namely communal detection and spike detection the working of these algorithms are explained with the help of the figure of system architecture. As shown in this figure the dataset is taken as the input which has the records of credit applications further the data records are taken based on the timestamp then the Communal Detection(CD) and Spike Detection (SD) algorithms gives the combined suspicion score. The CD and SD algorithms are explained as follows: Fig. System Architecture CD Algorithm: This algorithm helps for the bank when there are applications from the users specifically it is used to verify the duplicity of the users that is either by changing their name or else mobile number. 557

3 The problem is that when the name is nearby same for pronouncing to the previously applied name from the same contact number of home, identical address as well as identical area then there could be a possibility that the user might be trying to do some scam through the card. This mechanism issues the white list of the users whose data is not at all similar with the other data of the users. If there is any identical data then we need to blacklist that user and go for the manual verification which is a step in advance than normal communal detection. The need for communal detection is defined here. When there are two applications where in alike kind of records exist with very minute changes, there could be possibility of they being related or the same person is applying twice. Communal Detection is a method where such criteria is looked after. It works on fixed set of attributes and it uses a white-list oriented approach the communal relationships are records with have near identical values on the chosen attributes. A white-list is constructed with entities that display more probabilities of communal relationships. The algorithm takes exponential smoothing factor, input size threshold, state of alert, threshold for the similarity between the string, threshold of the attribute, exact duplicate filter, link-types in existing white-list, affecting window and current application as input furthermore returns output as suspicion score along with parameter change and the new whitelist. The steps for the CD algorithm is as given: Let Vi is current un scored application and labeled with set of attributes (a1,a2 an) Is compared with previous scored application Vj (a1,a2, an). 1. Attribute Vector: It finds attributes that exceed string similarity threshold; generate multi-attribute links against link types in current white-list when their duplicates similarity is more than attribute threshold. The first step of the CD algorithm matches every current application s value against a moving window of previous application s values to find links. S(ei,j)={1 if sim(ai,aj)>tsim 0 otherwise S(E)= s(ei,j) Where S(E) is attribute weight score and ei,j is the single-attribute match between the current value and a previous value. The first case uses Jaro-Winkler(.) which is a case sensitive method which match the linking current value as well as previous values from an additional similar attribute by cross referring. In the second case it is based on a non match as the values are not alike. 2. Current Weight: Using first step s multi-attribute links examine single link score. In further step of the CD algorithm accounts for weights of the attributes moreover it matches all current application s link beside the white list to discover communal relationships furthermore it reduces their link score. Here Wk is over all current weight between Vi,Vj. Rx;link-type is the linktype of the current whitelist. Thiis formula contains three cases. The first one uses attribute weights. The second one gives the link score of the grey list and in the third one it looked whether there are multiattribute links or not. 3. Average Previous Score: Using applications given above which are linked to Step1, examine the average of the preceding scores. In this step of the CD algorithm the calculation of all linked previous application s score designed for inclusion into the present application s score. The scores of previous steps act as the proven threshold. S(Vj) is average previous score of previous example. EO(Vj) is the number of outlinks from the previous application. In this equation, the first case computes each earlier application s average score while the second case is applied if there is no multi-attribute link. S(Vj) is average previous score of previous example. 4. Suspicion score: The suspicion score of the third and fourth step is examined. The fourth step of the CD algorithm is the calculation of all current application s score with every link along with previous application score. S(Vi)= S(E)+wi,j+S(Vj) Here the score of every recent application using prior score of the application and every link present over there is calculated. 558

4 5. Data Quality Improvement: The adaptive CD algorithm exchanges one random parameter s effectiveness based on the suspicion score for efficiency. SD Algorithm: The spike detection process is essential in order to develop adaptively as well as resilience of the proposed solution for Credit crime detection. The spike detection complements communal detection which providing attribute weights. This algorithm takes current application, present step, filter for the time difference, similarity threshold, moreover exponential smoothing factor as input and returns output as suspicion score along with attribute weights. The steps of the SD algorithm are as given: Vi is current unscored application. 1. Attribute Vector: Here present application s value is checked with prior applications in order to discover links using the following equation. S(ei,j)={1 if sim(ai,aj)>tsim 0 otherwise S(E)= s(ei,j) Where ai,j is the single-attribute match between the current value with previous value. The first case uses Jaro- Winkler (.)(Gordon et al, 2007), which is a case sensitive method which match the linking current value as well as previous values from an additional similar attribute by cross referring.. Time (.) which remains time alteration measured in minutes. The second case occur a dissimilar values that are not constant or else continue too quickly. 2. Single Value spike detection: Based on first step s matches, the current value s Score is calculated. In this step the calculation of all single current value s score through assimilating each as well as every steps to find spikes. The before steps act at the same time as the established baseline level. S(ai,j)=(1-α)+S(ai,k)+α* (St(ai,k)/t-1; Where S (ai,k) is the current value score. α is the exponential smoothing factor. 3. Multiple Score Value: In this step every present application s score is calculated by means of all values scores beside with attribute weights. S(Vi)= S(ai,j)*Wk Here S(vi) is the of the current application suspicion score of SD. 4. CD attribute weights change. By the end of all present Mini discrete data stream, in this step of the SD algorithm updates the attribute weights used for CD. At the end of current application cd weight is updated. Where wk is the attribute weight of the SD applied to the CD attributes. V. RESULT This is the communal fraud scoring data set and the file contains 21 attributes and records. These records are filtered for the redundant or duplicates records. This data set does not contain any such false records hence the whole of these records are stored into the database. 559

5 In the time stamp given select the date for the number of applications for that particular date here 1/1/2004 date is selected. It contains 35 records. For this date link type was generated. Link type was generated based on the similarity between these records. If the record has very similar attributes then it is taken as fraud. If three attributes between two records then it is taken as a fraud. Record 0 and 2 contains greater than 3 attributes are similar so it is crime. For this weight also calculated. This is the weight of the link types and this is called as white list. For this white list attribute weight is calculated and based on this weight, single link is produced and suspicion score is calculated. Based on suspicion score parameter are changed. After the parameter are changed the spike detection (SD) also applied for this date. Again for 35 records of this date link type and weight is calculated. This time the Link type is created if the record contains four exact similar attributes. After calculating link type weight is calculated. Based on the Communal Detection (CD) algorithm, update a Spike Detection (SD) and weight is updated. Weight is calculated and then multiple score. Based on the Spike Detection (SD) weight, update the Communal Detection (CD) weight. At the end of every current data process, Spike Detection (SD) algorithm calculated and updates attribute weight for Communal Detection (CD). VI. CONCLUSION The system detects the fraud detection online credit card or loan application. This system is used to avoid the duplicates and from the fraudsters while applying the credit card or applying for any loan. Data mining algorithms are used this system. These algorithms namely communal detection and spike detection used to detect the multiple applicants. This system combing with the spike detection and communal detection algorithms are used to make the system more efficient and secure. The identity thief has limited time because not guilty people can detect the fraud concerned early and quickly the victim can take the needful action. REFERENCES [1] D. Hand, Classifier Technology and the Illusion of Progress, Statistical Science, vol. 21, no. 1, pp. 1-15,doi: / , [2] W. Wong, Data Mining for Early Disease Outbreak Detection, PhD thesis, Carnegie Mellon Univ., [3] W. Wong, A. Moore, G. Cooper, and M. Wagner, Bayesian Network Anomaly Pattern Detection for Detecting Disease Outbreaks, Proc. 20th Int l Conf. Machine Learning (ICML 03), pp , [4] A. Goldenberg, G. Shmueli, R. Caruana, and S. Fienberg, Early Statistical Detection of Anthrax Outbreaks by Tracking Over-the- Counter Medication Sales, Proc. Nat l Academy of Sciences USA(PNAS 02), vol. 99, no. 8, pp , [5] R. Bolton and D. Hand, Unsupervised Profiling Methods for Fraud Detection, Statistical Science, vol. 17, no. 3, pp , [6] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java. Morgan Kauffman, [7] A. Bifet and R. Kirkby Massive Online Analysis, Technical Manual, Univ. of Waikato, [8] Experian. Experian Detect: Application Fraud PreventionSystem,Whitepaper, f/ experian_detect.pdf, [9] T. Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters, vol. 27, pp , 2006, doi: /j.patrec