The Use of Merging Algorithm to Real Ranking for Graph Search



Similar documents
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Practical Graph Mining with R. 5. Link Analysis

Comparison of Ant Colony and Bee Colony Optimization for Spam Host Detection

Protein Protein Interaction Networks

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Spam Host Detection Using Ant Colony Optimization

Enhancing Quality of Data using Data Mining Method

A Case Study of Calculation of Source Code Module Importance

Mining the Software Change Repository of a Legacy Telephony System

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Change Impact Analysis for the Software Development Phase: State-of-the-art

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Mining Textual Data for Software Engineering Tasks

Using Library Dependencies for Clustering

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis

Graph Mining and Social Network Analysis

How To Cluster On A Search Engine

Social Media Mining. Network Measures

Web Application Regression Testing: A Session Based Test Case Prioritization Approach

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

Topical Authority Identification in Community Question Answering

Obtaining Optimal Software Effort Estimation Data Using Feature Subset Selection

Spam Detection with a Content-based Random-walk Algorithm

SCAN: A Structural Clustering Algorithm for Networks

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

The PageRank Citation Ranking: Bring Order to the Web

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Part 1: Link Analysis & Page Rank

Regression Testing Based on Comparing Fault Detection by multi criteria before prioritization and after prioritization

Towards a Big Data Curated Benchmark of Inter-Project Code Clones

A Content based Spam Filtering Using Optical Back Propagation Technique

Spam Detection A Machine Learning Approach

Part 2: Community Detection

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Distance Degree Sequences for Network Analysis

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

How To Cluster Of Complex Systems

DYNAMIC QUERY FORMS WITH NoSQL

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

Framework for Intelligent Crawler Engine on IaaS Cloud Service Model

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Design and Experiments of small DDoS Defense System using Traffic Deflecting in Autonomous System

IMPROVING JAVA SOFTWARE THROUGH PACKAGE STRUCTURE ANALYSIS

SIP Service Providers and The Spam Problem

A Change Impact Analysis Tool for Software Development Phase

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Experiments in Web Page Classification for Semantic Web

Social Media Mining. Graph Essentials

Search and Information Retrieval

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Mining Social Network Graphs

KEYWORD SEARCH IN RELATIONAL DATABASES

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services

Data Mining Algorithms Part 1. Dejan Sarka

Character Image Patterns as Big Data

Optimizing Configuration and Application Mapping for MPSoC Architectures

MapReduce Approach to Collective Classification for Networks

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Towards better accuracy for Spam predictions

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Removing Web Spam Links from Search Engine Results

Feature Subset Selection in Spam Detection

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Application of Data Mining Techniques for Improving Software Engineering

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Chapter ML:XI (continued)

Prediction of Stock Performance Using Analytical Techniques

Conclusions and Future Directions

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Distributed forests for MapReduce-based machine learning

Transcription:

The Use of Merging Algorithm to Real Ranking for Graph Search A. Mohammad Reza Nami, B. Mehdi Ebadian Faculty of Electrical, Computer, and IT Engineering, Islamic Azad University- Qazvin Branch, Qazvin, IRAN ABSTRACT Ranking problem is becoming an important issue in many fields, especially in information retrieval. This paper presents an automatic technique for spam monitoring in the graph. The technique is based on combining information from two different sources: Truncated page rank and Semi-Streaming Graph Algorithms. In this paper we conduct further study on the heuristically ranking framework and provide measuring page rank of link farm. Twenty-six articles from 15 venues have been reviewed and classified within the taxonomy in order to organize and structure existing work in the field of Information Retrieval. Keywords Information retrieval (IR), Page rank (PR), Streaming Algorithms, Internet Marketing, Spam and Search Engine Optimization. Any attempt to deceive search engine's relevancy algorithm or "would not be done if search engines did not exist" So ethical attempt is different between SPAM and SEO (Search Engine Optimization). The relation between website and search engine administrator is adversarial. Stream graph algorithm: Suppose that we have a very large undirected, un-weighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-bfs disk page layout, and disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem. The graph itself looks like spaghetti think of random set of points randomly interconnected, with some bias towards shorter edges. 1 Introduction Search engines have being become the most lucrative thing over the internet. Search engines are mediated between Web platform and information seeker. Search engines then rank Web pages to create short list of highquality result. On the other hand, large visits originate from search engines that most users just click on first few results. Therefore, creating high score page independently of their real merit. SPAM: Each new communication Media creates opportunity for sending unsolicited messages. Type of electronic spam includes e-mail spam, instant messaging (SPIM), internet telephony (SPIT), spamming by mobile phone, by fax, and so on. The request responses paradigms of HTTP so goal is deceive search engines. Figure 1. Link farm (Link-Base Web Spam (Topological Spam))

Web spam techniques classified two groups: content (keyword) spam, and link spam. Link spam changes the sites structure by creating link farm. Link farm is densely connected pages to deceiving ranking algorithm by improving one user in group. Our spam-detection algorithm target are pages which receive most link-base ranking by participating in link farms but little relationship with rest of the graph. Links may not be spam, by buying advertising or buying expired domains that used legitimate purposes. Topological spamming is spamming which achieved by using Link farm. Link-based and content-based analysis offers two orthogonal approaches. Weakness of link-based: For some pages that statistically close to non spam pages. Threats of link -based: Hybrid spam structure. Opportunity of link-based: Link farms are expensive. Weakness of content -based: less resilient to changes in spammer strategies. Threats of content -based: Hybrid spam structure, copy entire Web site (change few out-link) is inexpensive. So they should be used together. Figure 2. Web Graph and supporter Distribution. Distribution of the fraction of distinct supporters found at varying distances (normalized), obtained by backward breadth-first visits from a sample of nodes, in four large Web graphs. Number of new distinct supporter increases up to certain distance, and the decreases, graph is limit in size and we approach effective diameter. 2 Algorithm Framework Fetterly et al. [2004] hypothesized statistical distribution about pages is a good way to detecting spam pages, "in a number of these distribution, outlier values are web spam". Baeza-Yates et al. [2006] introduce damping function for rank propagation. We want to explore the neighborhood of page and link structure artificially generated or not. Two algorithm challenges: 1. how to simultaneously compute statistics neighborhood of each page in huge web graph 2. how use it to detect and demote web spam 2.1 Supporter If there is a link page x to page y, the author of page x is recommending page y, the x is supporter of page y at distance d, if shortest path from x to y formed by links in E has length d. Figure 3. Different Bucket's page ranks. Calculate Page Rank (PR) of pages in the eu.int sub domain to showing different distribution in high and low ranked sites. Breadth-first search (BFS) instead of computing the distribution for all nodes of sample of large Web graphs. Advantage: inexpensive Disadvantage: memory for each marked nodes (N 2 ) time to repeat BFS.Solution: compute supporters only for subset of suspicious nodes constraint: we do not know a prior node is suspicious of being spam or not.

C is normalization constant is damping factor Algorithm 1: Link-analysis algorithm Link-analysis algorithm using semi-stream model, metric is score vector that uses O(N log N) bits memory. PR algorithm instead of BFS for web spam detection, for measure the centrality of nodes outcomes tree a specific node and not all nodes, whereas PR compute a score for all nodes in the graph at same time. 2.2 TRUNKATED PAGERANK A link-based ranking function that reduces importance of neighbors which topologically close to the target node. Damping function ignores direct contribution of the first levels of links. Spam pages should be very sensitive to changes in damping factor of PR calculation. A N N be citation matrix of G = (V, E), xy = 1 (x, y) E (1) P be row-normalized citation matrix, that all rows sum up to one, and rows of zeros replaced 1/N to avoid sink rank. W= [damping(t) N]P t Damping(t)={ 0 t T C t t > T (2) Algorithm 2

Bit propagation Algorithm for estimating number of distinct supporters at distance d of all nodes. Figure 4. 4times truncated page rank. With comparing PR and TPR, for value from 1 to 4, both closely correlated, an correlation decreases as more level truncated. 2.3 ESTIMATION SUPPORTERS Use probabilistic counting to compute estimation the number of supporter for all vertices in the graph at the same time. Figure 6. Distances of supporter in 3 types. Comparison of estimation average number of supporters against observed value in a sample of nodes, by assuming є = 1/N (3) Figure 5. Propagation of having supporter 1 and Not 0. Bit propagation algorithm. Page y has a link to page x, then vector of page x is updated: x x OR y

Table 1. Performance of this Article classifier UK2012 UK2013 True False True False F1 Metrics Positive Positive Positive Positive F1 Degree (D) 0.733 0.016 0.807 0.324 0.023 0.431 D + Page Rnk (P) 0.769 0.014 0.836 0.36 0.026 0.467 D+P +Trust Rank 0.785 0.013 0.847 0.54 0.038 0.596 D + P+ Trunc. PR 0.782 0.016 0.844 0.356 0.021 0.474 D + P +Est. Supporters 0.801 0.008 0.868 0.467 0.038 0.549 All attributes 0.806 0.01 0.872 0.586 0.038 0.632 And Estimation with adaptive Bit propagation, by dividing є two at each iteration b 3 Classification Precision P = tp/(tp + fp) P = #spam hosts classified as spam /(#hosts classified) Recall R = tp/(tp + fn) R = #spam hosts classified as spam/(#spam hosts) Fp False positive rate = #normal hosts classified as spam / (#normal hosts) Fn False negative rate = # spam host classified as spam / (#spam hosts) Table 3. Performance Using Page Rank Supporters degree Experimental Result Previouse F- True False F- Measure from Dataset Positive Positive Measure Table IV UK 0.801 0.008 0.866 0.834 pages 0.795 0.014 0.853 hosts 0.778 0.011 0.849 UK 0.465 0.033 0.549 0.459 pages 0.402 0.03 0.497 hosts 0.468 0.03 0.555 Table 2. Criterion "F" (Web spam techniques classification) Retrieved Relevant Spam hosts tp #spam hosts classified as spam Nonrelevant Normal hosts fp Not Retrieved fn tn #normal hosts not classified as spam

Figure 7. Best Iteration to find suitable distance 4 Conclusions The technique used for link analysis assigns to every node in Page Rank the web graph a numerical score between 0 and 1, known as its Page Rank. With the help of this paper the website owners and webmasters can decide which SEO practice is worth and will give a good return on investment. Finally, the use of regularization methods that exploit the topology of the graph and the locality hypothesis [Davison 2000b] is promising, as it has been shown that those methods are useful for general Web classification tasks [Zhang et al. 2006; Angelova and Weikum 2006; Qi and Davison 2006] and that can be used to improve the accuracy of Web spam detection systems [Castillo et al. 2007].

REFERENCES [1] Alexa Inc., http://www.alexa.com/help/traffic-learn-more last accessed on may 17, 2011 [2] Antoniol, G. and Guéhéneuc, Y. G., "Feature Identification: An Epidemiological Metaphor", IEEE Transactions on Software Engineering, vol. 32, no. 9, 2006, pp. 627-641. [3] Binkley D, Gold G, Harman M, Li Z, Mahdavi K (2008) An empirical study of the relationship between the concepts expressed in source code and dependence. J Syst Software 81:2287 2298 [4] Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R (2009) A systematic survey of program comprehension through dynamic analysis. IEEE Trans Software Eng (TSE) 35(5):684 702 [5] De Alwis B, Murphy GC (2008) Answering conceptual queries with Ferret. 30th International Conference on Software Engineering (ICSE 08), Leipzig, Germany, 21 30 [6] De Lucia, A., Fasano, F., Oliveto, R., and Tortora, G., "Recovering Traceability Links in Software Artefact Management Systems", ACM Transactions on Software Engineering and Methodology, 2007. [7] Egyed, A., Binder, G., and Grunbacher, P., "STRADA: A Tool for Scenario-Based Feature-to-Code Trace Detection and Analysis", in Proc. of IEEE/ACM 29th International Conference on Software Engineering (ICSE'07), 2007, pp. 41-42. [8] Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008a) CERBERUS: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. 16th IEEE International Conference on Program Comprehension (ICPC 08), Amsterdam, The Netherlands, 53 62 [9] Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho AV (2008b) Do crosscutting concerns cause defects? IEEE Trans Software Eng 34(4):497 515 [10] Gay G, Haiduc S, Marcus M, Menzies T (2009) On the use of relevance feedback in IR-based concept location. 25th IEEE International Conference on Software Maintenance (ICSM 09), Edmonton, Canada, 351 360 [11] Grant S, Cordy JR, Skillicorn DB (2008) Automated concept location using independent component analysis 15th Working Conference on Reverse Engineering (WCRE 08), Antwerp, Belgium, 138 142 [12] Hayes, J. H., Dekhtyar, A., and Sundaram, S. K., "Advancing candidate link generation for requirements tracing: the study of methods", IEEE Transactions on Software Engineering, vol. 32, no. 1, January 2006 2006, pp. 4-19. [13] Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of NL-queries for software maintenance and reuse. 31st IEEE/ACM International Conference on Software Engineering (ICSE 09), Vancouver, British Columbia, Canada [14] Kothari, J., Denton, T., Mancoridis, S., and Shokoufandeh, A., "On Computing the Canonical Features of Software Systems", in 13th IEEE Working Conference on Reverse Engineering (WCRE'06), Benevento, Italy, 2006. [15] Kuhn, A., Ducasse, S., and Gîrba, T., "Semantic Clustering: Identifying Topics in Source Code", Information and Software Technology, vol. 49, no. 3, March 2006, pp. 230-243. [16] Lawrance J, Bellamy R, Burnett M (2007) Scents in programs: does information foraging theory apply to program maintenance? IEEE Symposium on Visual Languages and Human-Centric Computing (VL/ HCC 07), IEEE, 15 22 [17] Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 07), Atlanta, Georgia, 234 243 [18] Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. King s College London, University of London. Ph.D [19] Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug location using latent dirichlet allocation. 15th Working Conference on Reverse Engineering (WCRE 08), Antwerp, Belgium, 155 164 [20] Poshyvanyk, D., Guéhéneuc, G. Y., Marcus, A., Antoniol, G., and Rajlich, V., "Feature Location using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval", IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007, pp. 420-432. [21] Rajlich, V., "Changing the Paradigm of Software Engineering", in Communications of ACM, vol. August, 2006, pp. 67-70. [22] Salah, M., Mancoridis, S., Antoniol, G., and Di Penta, M., "Scenario-driven dynamic analysis for comprehending large software systems", in Proc. of 10th European Conference on Software Maintenance and Reengineering (CSMR'06), 2006. [23]Shepherd, D., Fry, Z., Gibson, E., Pollock, L., and Vijay- Shanker, K., "Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns", in Proc. of International Conference on Aspect Oriented Software Development (AOSD'07), 2007, pp. 212-224. [24] Simmons, S., Edwards, D., Wilde, N., Homan, J., and Groble, M., "Industrial tools for the feature location problem: an exploratory study", Journal of Software Maintenance: Research and Practice, vol. 18, no. 6, 2006, pp. 457-474. [25]WordStreamTools, http://www.wordstream.com/adwordskeyword-tool on May 10, 2011 [26] Zhao, W., Zhang, L., Liu, Y., Sun, J., and Yang, F., "SNIAFL: Towards a Static Non-interactive Approach to Feature Location", ACM Transactions on Software Engineering and Methodologies, vol. 15, no. 2, 2006, pp. 195-226.