IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki



Similar documents
1. The Time Value of Money

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

Simple Linear Regression

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

APPENDIX III THE ENVELOPE PROPERTY

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Models for Selecting an ERP System with Intuitionistic Trapezoidal Fuzzy Information

10.5 Future Value and Present Value of a General Annuity Due

Average Price Ratios

CHAPTER 2. Time Value of Money 6-1

Numerical Methods with MS Excel

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

The Analysis of Development of Insurance Contract Premiums of General Liability Insurance in the Business Insurance Risk

The Digital Signature Scheme MQQ-SIG

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

Curve Fitting and Solution of Equation

Classic Problems at a Glance using the TVM Solver

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

The simple linear Regression Model

Entropy-Based Link Analysis for Mining Web Informative Structures

Automated Event Registration System in Corporation

RQM: A new rate-based active queue management algorithm

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

Settlement Prediction by Spatial-temporal Random Process

Maximization of Data Gathering in Clustered Wireless Sensor Networks

M. Salahi, F. Mehrdoust, F. Piri. CVaR Robust Mean-CVaR Portfolio Optimization

Online Appendix: Measured Aggregate Gains from International Trade

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Impact of Mobility Prediction on the Temporal Stability of MANET Clustering Algorithms *

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

Chapter Eight. f : R R

How To Make A Supply Chain System Work

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

Relaxation Methods for Iterative Solution to Linear Systems of Equations

Impact of Interference on the GPRS Multislot Link Level Performance

Banking (Early Repayment of Housing Loans) Order,

Network dimensioning for elastic traffic based on flow-level QoS

Green Master based on MapReduce Cluster

Low-Cost Side Channel Remote Traffic Analysis Attack in Packet Networks

USEFULNESS OF BOOTSTRAPPING IN PORTFOLIO MANAGEMENT

Vibration and Speedy Transportation

Statistical Intrusion Detector with Instance-Based Learning

of the relationship between time and the value of money.

Analysis of real underkeel clearance for Świnoujście Szczecin waterway in years

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

Real-Time Scheduling Models: an Experimental Approach

Loss Distribution Generation in Credit Portfolio Modeling

Agent-based modeling and simulation of multiproject

Analysis of Multi-product Break-even with Uncertain Information*

STOCHASTIC approximation algorithms have several

Research on Cloud Computing and Its Application in Big Data Processing of Railway Passenger Flow

Integrating Production Scheduling and Maintenance: Practical Implications

FINANCIAL MATHEMATICS 12 MARCH 2014

A Smart Machine Vision System for PCB Inspection

Study on prediction of network security situation based on fuzzy neutral network

Speeding up k-means Clustering by Bootstrap Averaging

Chapter = 3000 ( ( 1 ) Present Value of an Annuity. Section 4 Present Value of an Annuity; Amortization

Swarm Based Truck-Shovel Dispatching System in Open Pit Mine Operations

Performance Attribution. Methodology Overview

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

Modeling of Router-based Request Redirection for Content Distribution Network

Near Neighbor Distribution in Sets of Fractal Nature

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

On Error Detection with Block Codes

Reinsurance and the distribution of term insurance claims

Efficient Traceback of DoS Attacks using Small Worlds in MANET

RESEARCH PAPER SERIES

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Capacitated Production Planning and Inventory Control when Demand is Unpredictable for Most Items: The No B/C Strategy

Location Analysis Regarding Disaster Management Bases via GIS Case study: Tehran Municipality (No.6)

Using the Geographically Weighted Regression to. Modify the Residential Flood Damage Function

The Time Value of Money

Fundamentals of Mass Transfer

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

AN ALGORITHM ABOUT PARTNER SELECTION PROBLEM ON CLOUD SERVICE PROVIDER BASED ON GENETIC

Questions? Ask Prof. Herz, General Classification of adsorption

A particle swarm optimization to vehicle routing problem with fuzzy demands

Fast, Secure Encryption for Indexing in a Column-Oriented DBMS

A multi-layer market for vehicle-to-grid energy trading in the smart grid

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

Proactive Detection of DDoS Attacks Utilizing k-nn Classifier in an Anti-DDos Framework

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

The Popularity Parameter in Unstructured P2P File Sharing Networks

Transcription:

IDENIFICAION OF HE DYNAMICS OF HE GOOGLE S RANKING ALGORIHM A. Khak Sedgh, Mehd Roudak Cotrol Dvso, Departmet of Electrcal Egeerg, K.N.oos Uversty of echology P. O. Box: 16315-1355, ehra, Ira sedgh@eetd.ktu.ac.r, roudak@raseo.com Abstract: Amog the search eges, Google s oe of the most powerful. It uses a accurate rakg algorthm to order web pages search results. I ths paper, t s show that a smple lear model ca approxmately model the dyamcs goverg the behavour of Google. Least Squares s used for the system detfcato procedure. Idetfcato results are provded to show the effectveess of the detfed system. Copyrght 003 IFAC Keywords: Google Search Ege, Rakg Algorthm, System Idetfcato, Least Squares. 1. INRODUCION Search eges are tools to help web users fdg ther favourte formato or web stes. oday most people start at a search ege o the web. Popularty of search eges depeds o ther accurate search results. he more accurate the search results are, the more popular the search ege s. Search eges have become creasgly mportat o the web ad ther rakg procedure s a fudametal characterstc whch s mportat to web stes. E-commerce s much attued to the rakg ssue, because hgher rakg traslates drectly to more sales. A top rakg a major search ege ofte wll geerate more targeted traffc tha a expesve baer advertsg campag. Plus, a good search ege posto s free-ayoe ca do t (March, 1999). oday, Google s oe of the most mportat search eges. It s oe of the most mportat places to be well lsted. he resultg traffc ca be staggerg (Jmworld, 00). Google uses a accurate rakg algorthm to order web pages ts search results. Its rakg algorthm s ukow ad s ot avalable to publc. I order to compare the web pages ad order them accordg to ther relevace to the searched term, Google cosders may parameters for web pages whch are also ukow to the Google users. I ths paper, the dyamcs of the rakg algorthm of Google s cosdered as a black box. he puts of whch are the parameters collected from the web pages accordg to the searched term ad the outputs are the raks of those web pages whch the parameters are calculated for them Google search results. Smulato results are provded the paper to show the detfed model capabltes to predct the Google s behavour ts rakg algorthm.. LEAS SQUARES SYSEM IDENIFICAION Least Squares s a well-kow ad establshed techque for the detfcato of the dyamcal behavour of lear dyamcal plas. It s assumed that the computed varable y s gve as (Astrom ad Wttemark, 1995):

y = θ1 φ1(x) + θ φ (x) +... + θ φ (x) Where φ,φ,..., are kow parameters ad 1 φ (1) θ 1,θ,..., θ are the ukow parameters. Observato pars{ (x, y ), = 1,,..., N} are obtaed from the expermets performed ad the parameters are determed to mmze the followg cost fucto: 1 ) = N ε = 1 J(θ () where ε = y y (3) I a compact form, the soluto to the least squares method problem s gve by where θ φ 1 = ( φ) φ y (4) φ = : φ (x 1) (x ) N φ (5) s assumed fullrak ad also, [ φ φ...φ ] φ (x 1) = 1 (6) [ y y... ] y = (7) 1 y N 3. IDENIFICAION FRAMEWORK Google s search results are ot fxed ad the web pages raks always chage. Chages search results are caused by factors such as: dexg ew web pages by Google, effort of webmasters to optmze ther web stes for hgher rakgs, the dyamc property of Google s algorthm ad probably few bugs Google rakg algorthm. I ths paper, terms are searched dfferet tme cycles the the stored search results are compared order to select web pages whch ther raks mostly vary betwee 1 ad 100. o guaratee ths codto, 87 s the worst cosdered rak. hs kd of data collectg s for cosderg the steady state of raks ot the traset raks of some web pages. he ext step s to select the umber of observatos. By observg sum square error gve by equato () for may dfferet cases, the umber of the web pages was fally chose as 6. I selectg the web pages for the detfcato process, the followg pots are cosdered: - All search results web pages are gored f ther URLs (uform resource locator) have chaged. - he softwares ca calculate the web pages parameters. - Most of the parameters of some pages are zero. - Noe of the parameters of some web pages are ot zero. - Some of sequetal web pages have the same PageRak (ths seems to be essetal because t helps to clarfy other parameters flueces. PageRak s the Google s crtera measurg the mportace of web pages). - Number of data sets (N) s bgger tha ukow parameters (). 3.1 Selecto of the Iput Parameters For detfcato purposes, a mportat step s the selecto of the umber of put parameters. May resources state that Google uses more tha 100 parameters. (We, 00) pots that the ma factors Google search ege rakg algorthm are: - Achor text (the text of a lk or hyperlk) from Yahoo ad Dmoz - Achor text - PageRak - Keyword proxmty (the placemet of words o a web page relato to each other) Ad fally states that achor text s the most mportat factor Google s rakg algorthm. Geerally there are may reports about Google s parameters but so far there s o such cofrmed data regardg the Google s parameters. herefore ths paper based o the expereces of the authors these put parameters are cosdered: - PageRak - Keyword frequecy (how ofte a word appears a web page) the web pages vsble text. - Keyword desty (the umber of words appearg o a web page compared to the total umber of words appearg o that web page) the web pages ttle. - Keyword desty the web pages text. - Keyword desty the web pages lked text.

- Keyword desty the web pages AL tags (the AL tag s a label descrbg a mage.). - Keyword promece (how early a web page s text a word appears) the web pages text. Ad these places are cosdered Smlar Pages (whe a user select the Smlar Pages lk for a partcular result, Google automatcally scouts the web for pages that are related to ths result): After collectg ad ormalzg the put data the equato (4) s solved ad the raks are calculated by equato (1). Fg.1 shows the observed raks, the calculated raks are show Fg., the observed ad calculated raks are compared Fg.3 ad Fg.4 shows the errors the estmato of raks. - he ttle of Smlar Pages. - ext below the ttle (ths text s a excerpt from the retured result page wth the query terms bolded.) of Smlar Pages. - he category of Smlar Pages ODP (Ope Drectory Project). - he URL of Smlar Pages. o calculate the parameters whch are related to the Smlar Pages a bary form s cosdered: - If there s the searched term t, t s cosdered as 1. - If there s t the searched term t, t s cosdered as 0. hs kd of valuato method s show table 1. able 1 A bary form s cosdered to calculate the parameters whch are related to the Smlar Pages. It s assumed that the searched term s search ege optmzato. Q meas the searched term s avalable exactly. Q meas at least oe word of searched term s avalable. NC s abbrevato of Not Cosdered. Case Q Q www.searchegewatch.com NC 1 www.pada.com/searchworld/ NC 1 www.toka.com/optmzato/ NC 1 www.rakwrte.com NC 0 HE ILE AND HE EX BELOW HE ILE: search ege optmzato 1 1 HE ILE AND HE EX BELOW HE ILE: search ege placemet ad optmzato 0 1 HE ILE AND HE EX BELOW HE ILE: optmzato 0 1 CAEGORY: Computers>Iteret> >Search Ege Optmzato Frms>S NC 1 CAEGORY: Computers>Iteret> >Promoto>News NC 0 CAEGORY: Computers>Iteret> >Free Search Ege Submttg NC 1 Fg. 1. Observed raks. Fg.. Calculated raks.

acceptable error but ca ot predct the raks whch are out of ths rego. 4. CONCLUSION Fg. 3. Observed ad calculated raks are compared. Least squares detfcato s employed to detfy the dyamcs behavour goverg the rakg algorthm of Google search ege. It s show that a accurate detfcato of Google's rakg algorthm s ot possble, because Google does ot show the etre web pages whch are related to the searched term ad ts parameters are ot kow. However a approxmate lear model s foud to descrbe the rakg dyamcs. It s show that Google's dyamcs ca be partally modelled usg a lear model. Idetfcato results usg real data are preseted to show the ma pots of the paper. REFERENCES Fg. 4. Errors the estmato of the raks. Astrom, K.J. ad B. Wttemark (1995). Adaptve Cotrol, 41-89, Addso Wesley, New York. Jmworld. (00). Search ege forums at Jmworld, http://www.searchegeforums.com. March, F. (1999). Secret s to achevg a top 10 posto, Frst place software Ic. We, P. (00). Google search ege rakg algorthm aalyss, Pwqsoft Ic., http://www.pwqsoft.com/search-egerakg.htm. 3. Aalyss of the Idetfed System o test the accuracy of the detfed system, the parameters of the 5 web pages are calculated ad the ther raks are estmated. hese 5 web pages are ot used the detfcato calculatos. able shows the result of ths aalyss. able Aalyss of the accuracy of the detfed system Raks Google s Calculated Dfferece search results raks 1-34 35 35 51 16 37 3 5 70 65 5 8 53 9 90 65 5 he raks of the web pages whch are used the system detfcato are as follows: rak 87 (8) As t s show table, the detfed system ca predct the raks whch are betwee ad 87 wth a