Online Fraud Detection Model Based on Social Network Analysis

Transcription

1 Journal of Information & Computational Science :7 (05) May, 05 Available at Online Fraud Detection Model Based on Social Network Analysis Peng Wang a,b,, Ji Li a,b, Bigui Ji a,b a College of Computer Science, Chongqing University, Chongqing , China b Key Laboratory for Dependable Service Computing in Cyber Physics Society, Ministry of Education Chongqing , China Abstract With the rapid development of the Internet, the way of our living and thinking has changed. Because of the anonymity and low-cost legal sanctions, e-commerce has been booming on the Internet. Unfortunately, rapid commercial success has made e-commerce sites a lucrative medium for committing fraud. Therefore, we proposed a method that used for fraud detection and prevention on platform. First, we implemented a parallel web crawling agent to collect real users and transaction data. Second, we proposed Reverse Graph and Common Trade Cumulative Graph (CTCG) theory to extract features of common transaction. Third, we extracted the features of graph-level based on the Page-Rank and K-core clustering algorithm and replaced the PageRank values with reasonable TrustRank values, added BadRank values for identifying potential fraud users. Finally, we conducted a series of experiments using the Random Forest and verified the performance of our method by applying it to real transaction cases. In summary, our proposed model is effective in identifying potential fraudulent users on the fraud platform. Keywords: Fraudulent Platform; Social Network Analysis; Reverse Graph; CTCG; Random Forest Introduction With the rapid development of the Internet, the way of our living and thinking has changed. Because of the anonymity and low-cost legal sanctions, e-commerce has been booming on the Internet, people from around the world engaged in commodities trading millions of dollars every day. The world s largest e-commerce online trading platform Bay ( announced its 03 third quarter earnings report [], the report shows revenue for the third quarter was $3.9 billion. Unfortunately, according to an Internet Fraud Report issued by the Internet Crime Complaint Center (IC3) [], a joint operation between the FBI and the National White Collar Crime Center (NW3C), the number of complaints about Internet, fraud increased from 4,4 per month in 0 to 4,5 per month in 0. From January, 0 to December 3, 0, IC3 received a total Corresponding author. address: [email protected] (Peng Wang) / Copyright 05 Binary Information Press DOI: 0.733/jics00590

2 554 P. Wang et al. / Journal of Information & Computational Science :7 (05) of 9,74 complaints, representing a 39.4% (4,90) increase over the previous year. The total amount lost increased from $7. million in 00 to $55 million in 0. Among the assorted fraud types, non-delivery of merchandise ranked number one (.%). These figures indicate that online fraud causes significant losses. Despite the prevalence of online transaction fraud, but cannot make systematic solution used to identify fraudsters in the transaction, they just use the user s transactions and personal information to determine the user s credibility. The most popular e-commerce trading platforms, such as ebay and domestic TaoBao, what they used is based on evaluation mechanism of feedback accumulated. Users can take advantage of the anonymity and the low online auction fees to create multiple accounts and increase their rating scores via sham transactions. In this way, they can deceive the buyer with their high rating score. In addition to fraudulent groups, currently there is a more popular fraudulent way, fraud through the fraudulent platform to enhance their credibility. Therefore, the feedback mechanism promotes fraud in a sense. Despite these facts, Rubin et al. [3] proposed a new reputation system for the online trading site in 005. Wang & Chiu et al. s [5] studies on Internet auction fraud focused on the detection of abnormal rating behaviors of known auction fraud groups. J. S. Chang et al. [4] proposed a segment model based on time-line. S. J. Lin et al. [] proposed a ranking concept and social network analysis to detect collusive groups in online auctions. Thus, they did not provide a way to detect the fraud on platform. Therefore, we proposed a method which can detect fraud on platform. First, we implemented a parallel web crawling agent to collect real users and transaction data. Second, we proposed a Reverse Graph and Common Trade Cumulative Graph theory to extract features of common transaction. Third, we extracted features of graph-level based on the Page-Rank and K-core clustering algorithm and replaced the PageRank values with reasonable TrustRank values, added BadRank values for identifying potential fraud users. Finally, we conducted experiments using the Random Forest and verified the performance of our method by applying it to real transaction cases. In summary, our proposed model is effective in identifying a potential fraudulent user on fraud platform. Related Work. Social Network Analysis PageRank As we all know, PageRank algorithm is the use of link information between pages, and given the global importance score finally [7]. The main idea of PageRank algorithm is the importance of a page is closely related to page pointing to it, but also the importance of pages is interactive and mutually reinforcing. The definition of PageRank score of a page is: r(q) r(p) = α () o(q) q:(q,p) E wherein α is called damping or attenuation coefficient, N is the number of pages. With equal matrix equation can be expressed as: r = α M r + ( α) N N ()

3 P. Wang et al. / Journal of Information & Computational Science :7 (05) where, N is a size N N unit matrix. Seen from Eq. (), the PageRank score of a page p consists of two parts: one part comes from pointing to the page p; the other score are equal for all pages, and are static. All pages PageRank score can be calculated by Eq. (), iteration will tend to converge in the strict mathematical sense. However, in the actual calculation, usually tend to set a fixed number of iterations M. For this ordinary PageRank calculation process, to initial each page with equal score, and does not change in the iterative process. But may also be given to the page initially unequal scores that can be obtained or statistical analysis from prior knowledge, the formula can be expressed as: r = α T r + ( α) d (3) where, d is a static vector satisfied d(i) 0, i d(i) = (i =,..., N), d(i) represents the initial PageRank score of i-th page. This calculation method is called Biased PageRank. TrustRank TrustRank is a link analysis-based technique for semi-automated detection of spam pages, co-sponsored published by Yahoo! and Stanford University researchers in 004 [], and applied for a patent in 00. TrustRank value calculation formula used in the page can be represented as follows: t = β T t + ( β) s (4) where, t is a vector of all pages TrustRank value, β is the attenuation coefficient, s is a vector of static TrustRank, it is the initial value of all pages. BadRank BadRank value can be represented as follows: b = β U b + ( β) s (5) where U is the transpose of original web connection graph, β and s with the same meaning as in Eq. (5). BadRank algorithm is basically the same process with TrustRank algorithm. K-core K-core is a hot topic in social network research. In order to extract highly relevant sub-structures from complex social networks, such as community, groups, and core, and to find the relationship between these sub-structures, but also helps to describe complex network topology of the real world using this decomposition. In these respects, K-core is a basic and important concept. K-core is used widely in the social and behavioral sciences as well as social network clustering [0] and describes the evolution of sparse graphs [], it is also being used in bioinformatics and network visualization []. K-Core is a maximal sub-graph in which each node is adjacent to at least k other points. It is also thought to be an essential complement to the measurement of density, which may not capture many of the features of the global network. The mathematic definition of k-core is: Definition Let G = (V, L) be a graph. V is the set of vertices and L is the set of lines (edges or arcs). We will denote n = V, m = L. A sub-graph H = (W, L W ) induced by the set W is a k-core or a core of order k if and only if W : deg H (v) k, and H is the maximum sub-graph with this property. K-core can be used to describe the position of the vertex (the core or edge) in graph G, the greater the values of the vertices of the K-core, indicating that the closer the central vertex of graph G [3]. As shown in Fig., it represents simple graph K-core values decomposition. In the drawing the red vertex is 3, and are the core of graph.

4 55 P. Wang et al. / Journal of Information & Computational Science :7 (05) core -core 3-core corenness corenness corenness 3 Fig. : Sketch of the K-core decomposition for a small graph. Development of E-commerce Fraud E-commerce fraud can be divided into multiple accounts fraud, groups of fraudulent accomplices and fraudulent platform. Fraudsters created a lot of accounts for avoid fraud detection. Fraudsters made use of accomplices, who behave like honest users, except that they interact heavily with a small set of fraudsters in order to boost their reputation. Fraudulent platform provided a common place for users. Fraudsters Accom plices Honest Fraudulent platform Transaction platform Fig. : Fraudsters and accomplices form a near Bipartite Cores graph Fig. 3: The relationship between the users with different platforms 3 Detection Model 3. Reverse Graph (RG) Theory Through the analysis, the most popular form of fraud is the platform of fraud, in order to be able to more clearly reflect this fact and to quantify reasonably, in this paper we creatively proposed Reverse Graph (RG) theory, RG can well reflect the common transactions between users. Definition Graph G = (V, E, W ), e i, e j E, e i = (v i,, v i, ), e j = (v j,, v j, ), v i,, v i,, v j,, v j, V. If and only if v i, = v j,, then there are vertices v i,, v j, V and edges e i,j = (v i,, v j, ) E in the Reverse Graph generated by Graph G, and W (e i,j) accumulate min{w (e i ), W (e j )}, where W (e (i,j) ) denotes the weights of e i,j, W (e i )W (e j ) denotes the weights of e i and e j in graph G respectively. As shown in Fig. 4, it is a processes of simple undirected graph transformed to reverse graph, when generating reverse graph from original graph, consider the massive data and the time complexity, we choose a parallel method by using MapReduce. According to the weighted the edges E of original graph G to generate weighted adjacency matrix A, then generate Reverse Graph G. For the vertex weights in reverse graph can be expressed as the sum of edges weights

5 P. Wang et al. / Journal of Information & Computational Science :7 (05) e e, ' 3 e 3 (a) G (b) Reverse graph G' Fig. 4: The transformation to reverse graph on a simple undirected graph Original graph Adjacency matrix (edge weighted) Reverse graph (edge weighted) 9 3 [(,),(9,),(3,4)] 9 [] 4 [(,3)] 9 [(,),(,)] 3 3 [(,),(,)] 3 Fig. 5: The process of generating reverse graph from original graph connected to the vertex. We found that the weight equals of an edge is no sense, so we should minus when calculate the vertex weights in reverse graph. Reverse graph vertex v can be expressed as the weight W (v ), calculated by the above rules, mathematically: W (υ ) = (W (υ, υt) ) () (υ,υ t ) E W (υ, υ t) is the weight of edge W (υ ). 3. Common Trade Cumulative Graph As the analysis of above, the reverse graph theory can play a good role in detecting fraud accomplices, but this feature is not a good measure of behavior of fraudsters. In fraud groups, fraudsters will have a number of accomplices to improve the credibility of fraudsters, while their accomplices will trade with the honesty to improve their credibility. For the above phenomenon, we can accumulate the common transactions users when the user has common transactions, it will measure the sellers in transaction. In order to express clearly and have a reasonable quantification we proposed Common Trade Cumulative Graph (CTCG). Definition 3 Directed graph G = (V, E, W ), v k V, if and only if exist edges e i,k, e j,k connected with vertex v k and e i,k, e j,k E, v i, v j V, where e i,k = v i, v k, e j,k = (v j, v k ), then CTCG can be express as G = (V, E, W ), W (v k ) cumulative max{0, (W ((v i, v j ))-C)}. C is constant, W (v i, v j ) is the weight of reverse graph of directed graph G, W (v k ) is the weight of v k V in graph G, W (v i, v j ) is weight of edge (v i, v j ) G (E ) in reverse graph of directed graph. As shown in Fig., it is a processes of simple undirected graph transformed to CTCG.

6 55 P. Wang et al. / Journal of Information & Computational Science :7 (05) (a) Original graph G 5 0 (b) Reverse graph G' (c) CTCG G' Fig. : The transformation to CTCG on a simple directed graph 3.3 The Feature Set By previous introduction of Reverse Graph, Common Trade Cumulative Graph and features of SNA and user-level, the main attributes is designed for the currently popular fraudulent platform but also can be used to detect fraud groups. Meanwhile, the feature set is relatively streamlined, and selected a number of common attributes for buyers and sellers, also designed a number of differentiated features, so that the features can be a good representation of transaction behavior for buyer/seller in platform. Table : Feature set Feature attributes Seller Buyer Tips W (u) Y The weights of vertex in reverse graph W (u) Y The weights of vertex in CTCG TrustRank Y BadRank Y K-core Y Y Mean Y Y Variance Y Y Frequency Y Ratio of trading Y Shop conversion rate Y Getting from trading platform 4 Experiments 4. Datasets By crawlers to obtain user information and transaction records, then cleanse data, conversion and statistics, using artificial means to get black and white lists in experimental procedures, these data form the entire experimental data set, but for each statistical time there is a corresponding subdata sets. As shown in Fig. 7, which indicates stitches of data changes with month in the data set. Among them, the number of buy/seller and the number of transactions corresponds to the

7 P. Wang et al. / Journal of Information & Computational Science :7 (05) vertical axis on the left, while the blacklist and cumulative blacklists number corresponds to the vertical axis on the right. Users Cumulative blacklists Number of transactions 000 Buyers/sellers Blacklist Month Fig. 7: The size of dataset among months Blacklist 4. Model Evaluation and Analysis Accuracy is often used evaluation criteria in classification. It can reflect overall classification performance on data set classifier, but cannot reflect the excellent performance of the unbalanced datasets classification. For example, the data set contains 000 samples, of which 0 is positive type, the rest of data set is negative. If there is a classification of all samples will be divided into the negative type, although this can be obtained 99% accuracy, but in fact this classification is without any effect. Therefore, the classification of unbalanced data sets, we need to put forward a more reasonable evaluation criterion. Commonly used classification evaluation criteria are: Precision, Recall and F-measure values. These standards are calculated as follows. During the experiment, we select data sets for training in November 0, using the training model to predict December data, by changing the values of M (the number of months) and N (the number of trees of random forest) to observe the predicted F-measure value, and model training and prediction time. Fig. (a) shows the tree in the forest when the random number N = 300, the relationship between classification performance and M. And the amount of trees in the random forest Fig. 9 shows the use of the proposed model for detection of users behavior on the ebay platform in 0. Fig. 9 shows that, with increase of positive samples (fraudster) the accuracy of current model for the detection of future user behavior increasing obviously. Detection model obtain precision Precision = 5.% and recall Recall =.%, therefore, it is reasonable and effective for using cumulative feature vector of fraudsters. In this paper, we proposed Reverse Graph and Common Trade Cumulative Graph, it reflected the common trading behavior of buyers and sellers, and common transaction users of sellers. As shown in Fig. 0, it indicated the impaction of these two important features. We can know that these two features will improve the performance of model Table : The confusion matrix of binary classification Collusive account Normal account Collusive account tp (true positive) fp (false positive) Normal account fn (false negative) tn (true negative)

8 50 P. Wang et al. / Journal of Information & Computational Science :7 (05) N=300 M=3 F-measure F-measure Time (h) (a) Month Time (h) F-measure F-measure Time (h) (b) The numberof trees 5 Time (h) 4 3 Fig. : The size of dataset among months F-measure Precision F-measure Recall Feb Apr JunAugOct Dec F-measure CTCG Without Feb Apr JunAugOct Dec Fig. 9: The performance of user behavior detection on ebay in 0 Fig. 0: Effect of the introduction of two attributes to the detection performance We choose model of random forest as classification algorithm, due to the random forest for unbalanced classification data set has inherent advantages, in this part we compared the other classification algorithms, and Table 3 shows the performance comparison of various algorithms, the neural network using BP network, SVM polynomial kernel function selected, compared to other algorithm random forests has good performance. C5.0 decision tree algorithm was worst performance mainly due to the number of samples and artificial quantization. Table 3: The performance of comparing different algorithms Method Precision Recall F-measure Random Forest Neural Network SVM C5.0 Decision Tree Detection model presented in this paper is mainly based on SNA, it can detect fraud of platform. In order to prove the superiority of the performance of the proposed fraud detection models, we compared the other SNA detection algorithms. We compared the F-measure, Precision and recall of Wang & Chiu [5], Lin [], Wang et al. [5] proposed a method can achieve high precision, this is mainly due to the stringent sub-graph on the larger transaction graph. However, fraud and allies distributed all network and it not easy to form sub-graph on current fraudulent transaction platform, so Wang will miss a lot of fraudsters, greatly reduce the recall. For the model of

9 P. Wang et al. / Journal of Information & Computational Science :7 (05) Lin et al. [], they did not consider the imbalance classification, there were a lot of limitations on massive imbalance trading platform. The proposed model not only can detect traditional fraudulent groups, but also can detect the fraudulent groups on popular fraudulent platform in current, this model can also be used in imbalance classification, and on this basis we achieved good detection performance. We included the K-core attributes used by Wang, but also includes the value of the deformation TrustRank value used by Lin, in addition to this we introduced the BadRand [9], proposed attributes of common transaction behavior for detection of the fraud platform. We designed a parallel algorithm based on MapReduce for calculating the massive data. In summary, the time complexity of this study is a bottleneck in the model, but due to the feature extraction and model training can be computed by a good parallel algorithm, so the machine can be expanded to reduce the time complexity easily Recall Wang Lin Our Sept Oct Nov Dec Performance comparing Precision Wang Lin Our Sept Oct Nov Dec Performance comparing F-measure Wang Lin Our Sept Oct Nov Dec Performance comparing Fig. : The performance compared to the related approaches 5 Conclusions A series of cost-effective measures have been proposed in this study for improving the efficiency of fraud detection in e-commerce. By analyzing and researching the behavior of fraudulent users on the trading platform, we proposed the Reverse Graph theory and CTCG to extract common features of transactions. We designed the MapReduce parallel algorithms to calculate the eigenvalues based on these two theories in experiments. In theory and practice of systems on the basis of other scholars, we selected some important features of the user diagram level, including SNA in TrustRank values, BadRank values and K-core values. The TrustRank and BadRank values are used to measure the user s credibility, while K-cores value is used to find the close connected sub-graphs in the transactions data. This paper presents an efficient parallel crawler system for obtaining user and transaction data. The crawler system is mainly based on Java distributed design and multi-threading mechanism, and it uses the trade blacklist published on the web site (fraudster) as the initial crawler users and hierarchical traversal priority to access other user data, so that the system can get more practical values of user data in the shortest time. References [] ebay Inc. Reports Strong Third Quarter 03 Results, [] Internet Crime Complaint Center Annual Report,

10 5 P. Wang et al. / Journal of Information & Computational Science :7 (05) [3] S. Rubin, M. Christodorescu, An auctioning reputation system based on anomaly, Proceedings of the th ACM Conference on Computer and Communications Security, Alexandria, 005, [4] J. S. Chang, W. H. Chang, An early fraud detection mechanism for online auctions based on phased modeling, Proceedings of the 009 International Workshop on Mobile Systems E-Commerce and Agent Technology, Taipei, Taiwan, 009, [5] J. C. Wang, C. C. Chiu, Recommending trusted online auction sellers using social network analysis, Expert Systems with Applications, 34(3), 00, -79 [] S. J. Lin, Y. Y. Jheng, C. H. Yu, Combining ranking concept and social network analysis to detect collusive groups in online auctions, Expert Systems with Applications, 39(0), 03, [7] L. Page, S. Brin, The PageRank Citation Ranking: Bringing Order to the Web, Tech. Rep., Stanford University, 99 [] Zoltn Gyngyi, Hector Garcia-Molina, Jan Pedersen, Combating web spam with TrustRank, Proceedings of the 30th International Conference on Very Large Databases (VLDB), 004, [9] J. Botelho, C. Antunes, Combining social network analysis with semi-supervised clustering: A case study on fraud detection, Proceeding of Mining Data Semantics (MDS 0) in Conjunction with SIGKDD, 0, -7 [0] Seidman B. Stephen, Network structure and minimum degree, Social Networks, 5(3), 93, 9-7 [] B. Bollobas, The Evolution of Sparse Graphs, Graph Theory and Combinatorics, Academic Press, London, 94, [] Marco Gaertler, Patrignani Maurizio, Dynamic analysis of the autonomous system graph, The nd International Workshop on Inter-Domain Performance and Simulation (IPS), 004, 3-4 [3] Yanchao Zhang, Research on Information Dissemination and Opinion Evolutionin the Social Networking Service, Beijing Jiaotong University, 0