The Use of Merging Algorithm to Real Ranking for Graph Search

Transcription

1 The Use of Merging Algorithm to Real Ranking for Graph Search A. Mohammad Reza Nami, B. Mehdi Ebadian Faculty of Electrical, Computer, and IT Engineering, Islamic Azad University- Qazvin Branch, Qazvin, IRAN ABSTRACT Ranking problem is becoming an important issue in many fields, especially in information retrieval. This paper presents an automatic technique for spam monitoring in the graph. The technique is based on combining information from two different sources: Truncated page rank and Semi-Streaming Graph Algorithms. In this paper we conduct further study on the heuristically ranking framework and provide measuring page rank of link farm. Twenty-six articles from 15 venues have been reviewed and classified within the taxonomy in order to organize and structure existing work in the field of Information Retrieval. Keywords Information retrieval (IR), Page rank (PR), Streaming Algorithms, Internet Marketing, Spam and Search Engine Optimization. Any attempt to deceive search engine's relevancy algorithm or "would not be done if search engines did not exist" So ethical attempt is different between SPAM and SEO (Search Engine Optimization). The relation between website and search engine administrator is adversarial. Stream graph algorithm: Suppose that we have a very large undirected, un-weighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-bfs disk page layout, and disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem. The graph itself looks like spaghetti think of random set of points randomly interconnected, with some bias towards shorter edges. 1 Introduction Search engines have being become the most lucrative thing over the internet. Search engines are mediated between Web platform and information seeker. Search engines then rank Web pages to create short list of highquality result. On the other hand, large visits originate from search engines that most users just click on first few results. Therefore, creating high score page independently of their real merit. SPAM: Each new communication Media creates opportunity for sending unsolicited messages. Type of electronic spam includes spam, instant messaging (SPIM), internet telephony (SPIT), spamming by mobile phone, by fax, and so on. The request responses paradigms of HTTP so goal is deceive search engines. Figure 1. Link farm (Link-Base Web Spam (Topological Spam))

2 Web spam techniques classified two groups: content (keyword) spam, and link spam. Link spam changes the sites structure by creating link farm. Link farm is densely connected pages to deceiving ranking algorithm by improving one user in group. Our spam-detection algorithm target are pages which receive most link-base ranking by participating in link farms but little relationship with rest of the graph. Links may not be spam, by buying advertising or buying expired domains that used legitimate purposes. Topological spamming is spamming which achieved by using Link farm. Link-based and content-based analysis offers two orthogonal approaches. Weakness of link-based: For some pages that statistically close to non spam pages. Threats of link -based: Hybrid spam structure. Opportunity of link-based: Link farms are expensive. Weakness of content -based: less resilient to changes in spammer strategies. Threats of content -based: Hybrid spam structure, copy entire Web site (change few out-link) is inexpensive. So they should be used together. Figure 2. Web Graph and supporter Distribution. Distribution of the fraction of distinct supporters found at varying distances (normalized), obtained by backward breadth-first visits from a sample of nodes, in four large Web graphs. Number of new distinct supporter increases up to certain distance, and the decreases, graph is limit in size and we approach effective diameter. 2 Algorithm Framework Fetterly et al. [2004] hypothesized statistical distribution about pages is a good way to detecting spam pages, "in a number of these distribution, outlier values are web spam". Baeza-Yates et al. [2006] introduce damping function for rank propagation. We want to explore the neighborhood of page and link structure artificially generated or not. Two algorithm challenges: 1. how to simultaneously compute statistics neighborhood of each page in huge web graph 2. how use it to detect and demote web spam 2.1 Supporter If there is a link page x to page y, the author of page x is recommending page y, the x is supporter of page y at distance d, if shortest path from x to y formed by links in E has length d. Figure 3. Different Bucket's page ranks. Calculate Page Rank (PR) of pages in the eu.int sub domain to showing different distribution in high and low ranked sites. Breadth-first search (BFS) instead of computing the distribution for all nodes of sample of large Web graphs. Advantage: inexpensive Disadvantage: memory for each marked nodes (N 2 ) time to repeat BFS.Solution: compute supporters only for subset of suspicious nodes constraint: we do not know a prior node is suspicious of being spam or not.

3 C is normalization constant is damping factor Algorithm 1: Link-analysis algorithm Link-analysis algorithm using semi-stream model, metric is score vector that uses O(N log N) bits memory. PR algorithm instead of BFS for web spam detection, for measure the centrality of nodes outcomes tree a specific node and not all nodes, whereas PR compute a score for all nodes in the graph at same time. 2.2 TRUNKATED PAGERANK A link-based ranking function that reduces importance of neighbors which topologically close to the target node. Damping function ignores direct contribution of the first levels of links. Spam pages should be very sensitive to changes in damping factor of PR calculation. A N N be citation matrix of G = (V, E), xy = 1 (x, y) E (1) P be row-normalized citation matrix, that all rows sum up to one, and rows of zeros replaced 1/N to avoid sink rank. W= [damping(t) N]P t Damping(t)={ 0 t T C t t > T (2) Algorithm 2

4 Bit propagation Algorithm for estimating number of distinct supporters at distance d of all nodes. Figure 4. 4times truncated page rank. With comparing PR and TPR, for value from 1 to 4, both closely correlated, an correlation decreases as more level truncated. 2.3 ESTIMATION SUPPORTERS Use probabilistic counting to compute estimation the number of supporter for all vertices in the graph at the same time. Figure 6. Distances of supporter in 3 types. Comparison of estimation average number of supporters against observed value in a sample of nodes, by assuming є = 1/N (3) Figure 5. Propagation of having supporter 1 and Not 0. Bit propagation algorithm. Page y has a link to page x, then vector of page x is updated: x x OR y

5 Table 1. Performance of this Article classifier UK2012 UK2013 True False True False F1 Metrics Positive Positive Positive Positive F1 Degree (D) D + Page Rnk (P) D+P +Trust Rank D + P+ Trunc. PR D + P +Est. Supporters All attributes And Estimation with adaptive Bit propagation, by dividing є two at each iteration b 3 Classification Precision P = tp/(tp + fp) P = #spam hosts classified as spam /(#hosts classified) Recall R = tp/(tp + fn) R = #spam hosts classified as spam/(#spam hosts) Fp False positive rate = #normal hosts classified as spam / (#normal hosts) Fn False negative rate = # spam host classified as spam / (#spam hosts) Table 3. Performance Using Page Rank Supporters degree Experimental Result Previouse F- True False F- Measure from Dataset Positive Positive Measure Table IV UK pages hosts UK pages hosts Table 2. Criterion "F" (Web spam techniques classification) Retrieved Relevant Spam hosts tp #spam hosts classified as spam Nonrelevant Normal hosts fp Not Retrieved fn tn #normal hosts not classified as spam

6 Figure 7. Best Iteration to find suitable distance 4 Conclusions The technique used for link analysis assigns to every node in Page Rank the web graph a numerical score between 0 and 1, known as its Page Rank. With the help of this paper the website owners and webmasters can decide which SEO practice is worth and will give a good return on investment. Finally, the use of regularization methods that exploit the topology of the graph and the locality hypothesis [Davison 2000b] is promising, as it has been shown that those methods are useful for general Web classification tasks [Zhang et al. 2006; Angelova and Weikum 2006; Qi and Davison 2006] and that can be used to improve the accuracy of Web spam detection systems [Castillo et al. 2007].

7 REFERENCES [1] Alexa Inc., last accessed on may 17, 2011 [2] Antoniol, G. and Guéhéneuc, Y. G., "Feature Identification: An Epidemiological Metaphor", IEEE Transactions on Software Engineering, vol. 32, no. 9, 2006, pp [3] Binkley D, Gold G, Harman M, Li Z, Mahdavi K (2008) An empirical study of the relationship between the concepts expressed in source code and dependence. J Syst Software 81: [4] Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R (2009) A systematic survey of program comprehension through dynamic analysis. IEEE Trans Software Eng (TSE) 35(5): [5] De Alwis B, Murphy GC (2008) Answering conceptual queries with Ferret. 30th International Conference on Software Engineering (ICSE 08), Leipzig, Germany, [6] De Lucia, A., Fasano, F., Oliveto, R., and Tortora, G., "Recovering Traceability Links in Software Artefact Management Systems", ACM Transactions on Software Engineering and Methodology, [7] Egyed, A., Binder, G., and Grunbacher, P., "STRADA: A Tool for Scenario-Based Feature-to-Code Trace Detection and Analysis", in Proc. of IEEE/ACM 29th International Conference on Software Engineering (ICSE'07), 2007, pp [8] Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008a) CERBERUS: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. 16th IEEE International Conference on Program Comprehension (ICPC 08), Amsterdam, The Netherlands, [9] Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho AV (2008b) Do crosscutting concerns cause defects? IEEE Trans Software Eng 34(4): [10] Gay G, Haiduc S, Marcus M, Menzies T (2009) On the use of relevance feedback in IR-based concept location. 25th IEEE International Conference on Software Maintenance (ICSM 09), Edmonton, Canada, [11] Grant S, Cordy JR, Skillicorn DB (2008) Automated concept location using independent component analysis 15th Working Conference on Reverse Engineering (WCRE 08), Antwerp, Belgium, [12] Hayes, J. H., Dekhtyar, A., and Sundaram, S. K., "Advancing candidate link generation for requirements tracing: the study of methods", IEEE Transactions on Software Engineering, vol. 32, no. 1, January , pp [13] Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of NL-queries for software maintenance and reuse. 31st IEEE/ACM International Conference on Software Engineering (ICSE 09), Vancouver, British Columbia, Canada [14] Kothari, J., Denton, T., Mancoridis, S., and Shokoufandeh, A., "On Computing the Canonical Features of Software Systems", in 13th IEEE Working Conference on Reverse Engineering (WCRE'06), Benevento, Italy, [15] Kuhn, A., Ducasse, S., and Gîrba, T., "Semantic Clustering: Identifying Topics in Source Code", Information and Software Technology, vol. 49, no. 3, March 2006, pp [16] Lawrance J, Bellamy R, Burnett M (2007) Scents in programs: does information foraging theory apply to program maintenance? IEEE Symposium on Visual Languages and Human-Centric Computing (VL/ HCC 07), IEEE, [17] Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 07), Atlanta, Georgia, [18] Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. King s College London, University of London. Ph.D [19] Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug location using latent dirichlet allocation. 15th Working Conference on Reverse Engineering (WCRE 08), Antwerp, Belgium, [20] Poshyvanyk, D., Guéhéneuc, G. Y., Marcus, A., Antoniol, G., and Rajlich, V., "Feature Location using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval", IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007, pp [21] Rajlich, V., "Changing the Paradigm of Software Engineering", in Communications of ACM, vol. August, 2006, pp [22] Salah, M., Mancoridis, S., Antoniol, G., and Di Penta, M., "Scenario-driven dynamic analysis for comprehending large software systems", in Proc. of 10th European Conference on Software Maintenance and Reengineering (CSMR'06), [23]Shepherd, D., Fry, Z., Gibson, E., Pollock, L., and Vijay- Shanker, K., "Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns", in Proc. of International Conference on Aspect Oriented Software Development (AOSD'07), 2007, pp [24] Simmons, S., Edwards, D., Wilde, N., Homan, J., and Groble, M., "Industrial tools for the feature location problem: an exploratory study", Journal of Software Maintenance: Research and Practice, vol. 18, no. 6, 2006, pp [25]WordStreamTools, on May 10, 2011 [26] Zhao, W., Zhang, L., Liu, Y., Sun, J., and Yang, F., "SNIAFL: Towards a Static Non-interactive Approach to Feature Location", ACM Transactions on Software Engineering and Methodologies, vol. 15, no. 2, 2006, pp