Web Spam, Propaganda and Trust
|
|
|
- Beatrice Crawford
- 10 years ago
- Views:
Transcription
1 Web Spam, Propaganda and Trust Panagiotis T. Metaxas Wellesley College Wellesley, MA 02481, USA Joseph DeStefano College of the Holy Cross Worcester, MA 01610, USA ABSTRACT Web spamming, the practice of introducing artificial text and links into web pages to affect the results of searches, has been recognized as a major problem for search engines. It is also a serious problem for users because they are not aware of it and they tend to confuse trusting the search engine with trusting the results of a search [16]. The parallels between web spamming on the internet and propaganda in the real world suggest that we can use anti-propaganda techniques to educate users and develop tools to help them evaluate the reliability of the information they find online. In this paper, we first analyze the effects that web spam has on the evolution of the search engines and their relationship to propagandistic techniques in society. Then, we examine the neighborhoods of untrustworthy sites, finding that a dense biconnected component (BCCs) containing the site provide a reasonable trust neighborhood that has parallels in social network theory. The fact that spammers employ propagandistic techniques enables us to design a heuristic that follows anti-propagandistic practices in order to recognize a spamming network. In society, recognition of an untrustworthy message (in the opinion of a particular person or other social entity) is a reason for questioning the entities that recommend the message. Entities that are found to strongly support more untrustworthy messages become untrustworthy themselves. So, social distrust is propagated backwards for a number of steps. Our heuristic simulates this behavior on the trust neighborhood of a spammer. In our experiments, we examined trust neighborhoods of web sites, both trustworthy and not. Our findings suggest that spamming networks can be reliably recognized from their relationship to a single untrustworthy starting point by backward propagation of distrust. Further, nodes involved in a spamming network can be divided into two groups: those that have content similar to the starting site (aka link farms ), and those that have dissimilar content (aka mutual admiration societies ). Our tool explores thousands of nodes within minutes and could be deployed at the browserlevel, making it possible to resolve the moral question of who should be making the decision of weeding out spammers in favor of the end user. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.m [Information Storage and Copyright is held by the author/owner(s). WWW2005, May 10 14, 2005, Chiba, Japan.. Retrieval]: Miscellaneous General Terms Algorithms, Experimentation, Social Networks, Propaganda, Trust Keywords search, Web graph, link structure, PageRank, HITS, Web spam 1. INTRODUCTION The web has changed the way we inform and get informed. Every organization has a web site and people are increasingly comfortable accessing it for information for any question they may have. The exploding size of the web necessitated the development of search engines and web directories. Most people with online access use a search engine to get informed and make decisions that may have medical, financial, cultural, political, security or other important implications [10, 37, 23, 29]. Moreover, 85% of the time, people do not look past the first ten results returned by the search engine [35]. Given this, it is not surprising that anyone with a web presence struggles for a place in the top ten positions of relevant web search results. The importance of the top-10 placement has given birth to a new industry, which claims to sell know-how for prominent placement in search results and includes companies, publications, and even conferences. Some of them are willing to bend the truth in order to fool the search engines and their customers, by creating web pages containing web spam. Web spamming is the practice of manipulating web pages in order to cause search engines to rank some web pages higher than they would without any manipulation. 1 The motive is usually commercial, but can also be political, or religious. The creators of web spam are often specialized companies selling their expertise as a service, but can also be the web masters of the companies and organizations that would be their customers. Spammers attack search engines through text and link manipulations [22, 18]: 1 We should mention here that there is not a complete agreement on the definition of web spam among authors, which leads to some confusion. Moreover, to people unfamiliar with web spam, the term is mistaken for spam. A more descriptive name for it would be search engine ranking manipulation or adversarial information retrieval.
2 Text spam: This includes excessively repeating text and/or adding irrelevant text on the page that will cause incorrect calculation of page relevance; adding misleading meta-keywords or irrelevant anchor text that will cause incorrect application of rank heuristics. Link spam: This technique aims to change the perceived structure of the webgraph in order to cause incorrect calculation of page reputation. Such examples are the so-called link-farms, mutual admiration societies, page awards, domain flooding (plethora of domains that re-direct to a target site), etc. Both kinds of spam aim to boost the ranking of spammed web pages. Sometimes cloaking is included as a third spamming technique [22, 19]. Cloaking aims to serve different pages to search engine robots and to web browsers (users). These pages could be created statically or dynamically. Static pages, for example, may employ hidden links and/or hidden text with colors or small font sizes noticeable by a crawler but not by a human. Dynamic pages might change content on the fly depending on the visitor, fake the clickstream or query stream, submit millions of pages to add-url forms of search engines, etc. We consider the false links and text themselves to be the spam, while, strictly speaking, cloaking is a tool that helps spammers hide their attacks. Since anyone can be an author on the web, these practices have naturally created a question of information reliability. An audience used to trusting the written word of newspapers and books is unable, unprepared or unwilling to think critically about the information obtained from the web. In a recent study [16] we found that while college students regard the web as a primary source of information, many do not check more than a single source, and have trouble recognizing trustworthy sources online. In particular, two out of three students are consistently unable to differentiate between facts and advertising claims, even infomercials. At the same time, they have considerable confidence in their abilities to distinguish trustworthy sites from non-trustworthy ones, especially when they feel technically competent. We have no reason to believe that the general public will perform any better than well-educated students. In fact, a recent analysis of internet related fraud by a major Wall Street law firm [10] put the blame squarely on the investors for the success of stock fraud cases. One of the reasons behind the users difficulty to distinguish trustworthy from untrustworthy information comes from the success that both search engines and spammers have enjoyed in the last decade. Users have come to trust search engines as a means of finding information, and spammers have successfully managed to get them to transfer that trust to the results of the search. There is clearly a need for education of the users, so that people develop a healthy suspicion of unverified search results. Beyond that, though, there is a need for browser-level tools that will help the user move from suspicion to decision in determining which sites to trust. From their side, the search engines have struggled to deliver spam-free results and have developed sophisticated search result ranking strategies. Two such ranking strategies that have received major attention are the PageRank [6, 2] and HITS [26] algorithms. Achieving high PageRank has become a sort of obsession for many companies IT departments, and the raison d être of spamming companies. Some estimates indicate that at least 8% of all pages indexed is spam [12] while experts consider web spamming the single most difficult challenge web searching is facing today [22]. In this paper we first examine the reasons web spamming has been so successful and its relationship to social propaganda. Then, we develop heuristics that are able to recognize web neighborhoods, especially untrustworthy ones. We present experimental results that show considerable success in recognizing spamming neighborhoods. Finally, we discuss what we believe should be a frame for the long-term approach to web spam. 2. THE WEBGRAPH AS A SOCIAL NET- WORK The web is typically represented by a directed graph [8]. The nodes in the webgraph are the pages (or sites) that reside on servers on the internet. Arcs correspond to hyperlinks that appear on web pages (or sites). The theory of social networks [38] also uses directed graphs to represent relationships between social entities. The nodes (called actors ) correspond to social entities (e.g., people, institutions, ideas). Arcs (called ties ) correspond to social relations between the entities they connect (e.g., has influence on, knows, trusts). This connection is more than a similarity in descriptions. The web itself is a social creation, and both PageRank and HITS are socially inspired algorithms [6, 26]. Socially inspired systems are subject to socially inspired attacks, however. Not surprisingly then, the theory of propaganda [28] can provide intuition into the dynamics of the web. For example, PageRank is based on the assumption that the reputation of an entity (a web site in this case) can be measured as a function of both the number and reputation of other entities recommending it. A link to a web page is counted as a vote of confidence to this web site, and in turn, the reputation of a page is divided among those it is recommending [6]. Since HTML does not provide for positive and negative links, all links are taken as positive. This is not always true, but is considered a reasonable assumption. More importantly, there is also the implicit assumption is that hyperlink voting is taking place independently, without prior agreement or central control. Spammers, like social propagandists, are groups of sites that are able to gather a large number of such votes of confidence by design, thus breaking the assumption of independence in a hyperlink. Search engines consider such moves spam, and would like to restrict it, but there can be no algorithm that can recognize spamming sites automatically based on graph isomorphism [5]. 3. EVOLUTION OF THE SEARCH ENGINES In the early 90 s, when the web numbered just a few million servers, the first generation search engines were ranking search results using classic information retrieval techniques: the more rare words two documents share, the more similar they are considered to be [34, 21]. A search query Q is simply a short document and the results of a search for Q are ranked according to their (normalized) similarity to the query which was treated as the value of the page. The first attack to this tf.idf ranking, as it is known, came from within the search engines. Around 1995, search
3 engines started selling search keywords to advertisers as a way of generating revenue: If a search query contained a sold keyword, the results would include targeted advertisement and a higher ranking for the link to the sponsor s web site. This is the first time we have a socially inspired ranking, which follows marketing practices of the real world. Mixing search results with paid advertisement raised serious ethical questions, but also showed the way to financial profits to spammers who started their own attacks by creating pages containing many rare keywords to obtain a higher ranking score. In terms of propaganda theory, the spammers employed a variation of the technique of glittering generalities to confuse the first generation search engines [28, 47]. The propagandist associates one or more suggestive words without evidence to alter the conceived value of a person or idea. To avoid spammers (and public embarrassment from the keyword selling practice), search engines would keep secret their exact ranking algorithm. Secrecy is no defense, however, since secret rules can be figured out by experimentation and reverse engineering (e.g., [33, 30]). Second generation search engines started employing more sophisticated ranking techniques in an effort to nullify the effects of glittering generalities. One of the more successful ones was based on the link voting principle : Each web site s has value equal to its popularity, which is influenced by the set B s of sites pointing to site s. Lycos became the champion of this ranking technique and had its own popularity skyrocket around 1996 [31]. Doing so, it was also distancing itself from the ethical questions introduced by combining advertising with ranking. Unfortunately, this ranking method did not succeed in stopping spammers either. Spammers started creating clusters of interconnected web sites that had identical or similar contents with the site they were promoting, which subsequently became known as link farms (LF). The link voting principle was socially inspired, so spammers used the well known propagandistic method of bandwagon to circumvent it [28, 105]. Using this technique, the propagandist is promoting the impression of a high degree of recommendation by inter-linking many internally controlled sites that will eventually all share high ranking. The introduction of PageRank in 1998 was a major development for search engines, because it seemed to provide a more sophisticated anti-spamming solution to the bandwagon technique. Under PageRank, not every link contributes equally to the reputation of a page. Instead, links from highly reputable pages contribute much higher than links from other sites. That way, the site networks developed by spammers would not influence much their PageRank, and Google became the search engine of choice. A page p has value equal to its reputation R(p) which is calculated as the sum of fractions of the reputations of the set B p of pages pointing to p. HITS is another socially-inspired ranking which has also received a lot of attention [26]. The HITS algorithm divides the sites related to a query between hubs and authorities. Hubs are sites that contain many links to authorities, while authorities are sites pointed to by the hubs. (This circular definition can be resolved.) PageRank and HITS marked the development of the third generation. 2 Unfortunately, spammers have again found 2 [7] considers the search engines in our 2nd and 3rd generation to be in the same group. We believe that both the ranking and attack methods puts them in different cateways of circumventing PageRank. In PageRank, a page enjoys some absolute reputation, that is, its reputation is not restricted on some particular issue. So, spammers develop sites with expertise on irrelevant subjects, and they justifiably acquire high ranking on their expert sites. Then they interlink their networked sites with the expert sites, creating what is called a mutual admiration society (MAS), causing all sites to share a higher PageRank and the search engine is fooled. This is the well-known propagandistic technique known as testimonials, where well known people (entertainers, public figures, etc.) offer their opinion on issues about which they are not experts [28, 74]. HITS has also shown to be highly spammable by this technique [25] due to the fact that its effectiveness depends on the accuracy of the initial neighborhood calculation. The table below summarizes our findings for the first three generation of search engines and the correspondence between web spam and social propaganda. SE Ranking Spamming Propaganda 1st Gen Doc keyword glittering Similarity stuffing generalities 2nd Gen + Site + link + bandwagon popularity farms 3rd Gen + Page + mutual + testimonials reputation admiration soc. Web search corporations are reportedly busy developing the engines of the next generation [7]. The new search engines hope to be able to recognize the need behind the query of the user. Given the success the spammers have enjoyed so far, one wonders how will they spam the fourth generation engines. Is it possible to create a ranking that is not spammable? Put another way, can the web as a social space be free of propaganda? Seen in this light, it appears that we are trying to create in cyberspace what human societies have not succeeded in creating in their social space. However, as in society, we can learn to live successfully with propaganda, given appropriate education and technology. 4. EXPLORING WEB NEIGHBORHOODS Since spammers employ propagandistic techniques, as we have argued above, it makes sense to design anti-propagandistic methods for defending against them. These methods need to be user-guided. Propaganda, after all, [11], does not always have a negative connotation. Advertisement is a form of propaganda that we have all learned to live with. The art of persuasion is objectionable mainly when it is used to promote an untrustworthy message according to the receiver s opinion. When such an untrustworthy message is detected, it becomes a reason for us to reconsider the messenger. Messengers who strongly support an untrustworthy message become untrustworthy themselves. This process is selectively repeated for a few steps, propagating the distrust of the original back to those who show support for it. The results of this process become part of the user s belief system and are used to filter future information. Propagation of distrust contrasts with the propagation of trust in that it progresses backwards through the graph, i.e. a page s reliability decreases if it links to untrustworthy sites. Current algorithms for ranking pages propagate trust gories.
4 _http:// _http://invisiblekeylogger.com/ 0.186_http:// _http:// _http:// 0.22_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.15_http:// _http:// _http:// _http:// 0.2_http:// _http:// _http:// _http:// _http:// 0.29_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.21_http:// _http:// 0.27_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://mechmates2k2.tripod.com/ 0.102_http:// _http:// _http:// _http://24up.org/ 0.322_http:// _http:// _http://expage.com/ 0.0_http:// 0.0_http://pub.alxnet.com/ 0.04_http:// _http:// _http://dir.ez2find.com/ 0.053_http://directoriogratis.com/ 0.0_http://directory.tiscali.it/ 0.019_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.29_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://online-loan.info/ 0.177_http:// _http://icashservices.com/ 0.312_http:// 0.04_http:// _http:// 0.08_http://services.moneymaker6.biz/ 0.197_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.09_http://enjoy-prepaid.com/ 0.037_http:// _http://fasttrack.nii.co.nz/ 0.158_http://hilltop.nii.co.nz/ 0.326_http://kesda.nii.co.nz/ 0.175_http://healthandwealth.nii.co.nz/ 0.038_http:// _http:// _http:// 0.2_http://alphadiva.nii.co.nz/ 0.272_http:// 0.14_http:// _http:// 0.0_http:// 0.04_http://warp.crystalad.com/ 0.162_http:// _http:// _http:// _http:// _http:// _http://treasurecoasthealth.com/ 0.244_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://weight-loss-sports-hair-skin-spa-body-aromatherapy-oil.hic-india.com/ 0.132_http:// _http:// _http:// _http:// _http:// _http://online-pharmacy.1clickguide.com/ 0.178_http:// _http:// 0.14_http:// _http://sportsaccessories.com/ 0.038_http://directory.b24.it/ 0.201_http:// _http://dmoz.supereva.it/ 0.197_http:// _http:// 0.14_http:// _http:// 0.24_http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.12_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://chinese.smokershops.com/ 0.099_http://c2.gostats.com/ 0.058_http://viaggiare.cerca.com/ 0.329_http:// _http://catalogo.cerca.com/ 0.166_http://sendcigarettes.com/ 0.19_http://one-soul.net/ 0.0_http:// _http://seduction-dating-tips.com/ 0.166_http:// 0.23_http:// _http://a-to-z-books.com/ 0.224_http:// 0.0_http:// 0.04_http:// _http:// _http:// _http:// _http://hypnosisonline.com/ 0.263_http:// 0.0_http://hypnotize.enikprod.com/ 0.277_http:// _http://linkspider.co.uk/ 0.18_http:// 0.19_http:// _http:// _http:// _http:// _http:// _http://store.thepromotionspot.com/ 0.0_http:// 0.04_http:// _http:// 0.31_http://weightlossandfitnesstips.com/ 0.14_http://atkinsdiet.nc38.com/ 0.235_http:// _http:// _http://scriptstoprofit.com/ 0.227_http:// _http://detoxification.ws/ 0.235_http:// 0.4_http:// _http:// _http:// _http:// _http:// 0.0_http://lib.kts.ac.kr/ 0.088_http:// _http:// _http://supermodels.nl/ 0.214_http://bwgreyscale.com/ 0.199_http://weightloss-journey.com/ 0.158_http:// _http:// _http://nexium.extramedsonline.com/ 0.063_http:// _http://hustadlas free.com/ 0.322_http:// _http:// _http://sitehelpcenter.com/ 0.212_http://kevinocasio.com/ 0.279_http://reciprocalmanager.com/ 0.051_http://websitedevelopmentcenter.com/ 0.218_http:// 0.17_http:// _http:// 0.43_http:// _http:// _http://beautylistings.com/ 0.231_http://bonose.com/ 0.181_http:// _http:// _http:// _http://virilityhealth.vitamine-und-mehr.org/ 0.134_http:// _http:// _http:// 0.15_http:// 0.04_http://potenzmittel.androxan.biz/ 0.143_http:// _http:// _http://texas-holdem-online-poker.com/ 0.097_http:// _http:// _http:// _http:// 0.06_http://directory.dominion-web.com/ 0.189_http://asia.yahoo.com/ 0.111_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.21_http://ebizwhiz-publishing.com/ 0.188_http:// _http://part-time-work-at-home-job.com/ 0.179_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http://server.scripthost.com/ 0.21_http:// _http:// _http:// _http:// 0.0_http:// 0.0_http:// _http:// _http://emeagwali.com/ 0.058_http:// _http:// _http://homebasedbusiness.biz-whiz.com/ 0.254_http:// _http:// _http:// _http:// _http:// _http:// 0.25_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://jollyroger.com/ 0.153_http:// _http:// _http://wildnetafrica.co.za/ 0.083_http:// _http:// _http:// _http://killdevilhill.com/ 0.096_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.1_http://hypertext.rmit.edu.au/ 0.218_http:// _http:// _http:// _http://mathforge.net/ 0.304_http://paudio.com/ 0.059_http:// 0.04_http:// _http:// _http:// _http:// 0.27_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://dir.i-une.com/ 0.184_http://secure.canadiandrugs.ca/ 0.23_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://myfastsearch.com/ 0.149_http:// _http:// _http:// _http:// _http:// _http:// 0.0_http://ev-julia.chat.ru/ 0.158_http:// _http:// 0.21_http://womencentral.net/ 0.087_http:// _http://directoryweb.candlesbystone.com/ 0.259_http:// _http://olewoodcrafter.tripod.com/ 0.162_http:// _http:// _http:// _http:// _http://americanprofessional.com/ 0.158_http:// 0.15_http:// _http:// _http://thegreatamericanmall.com/ 0.172_http:// _http:// 0.19_http:// _http://ateammasters.com/ 0.0_http:// _http:// _http:// _http:// _http://spdos2000.tripod.com/ 0.0_http://thaiall.thailandhosting.net/ 0.158_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://members.impulse.net/ 0.203_http:// _http:// 0.0_http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.22_http:// 0.22_http://shoppingguide.biz/ 0.378_http:// _http://directory.webguest.com/ 0.293_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.11_http:// _http:// _http:// _http://rochemiette.ca/ 0.0_http://acne.enikprod.com/ 0.256_http:// 0.15_http:// _http:// 0.33_http:// _http:// _http:// _http:// _http://cosmicmysteries.com/ 0.26_http:// _http:// _http:// _http://pages.ivillage.com/ 0.279_http:// _http:// _http:// 0.0_http:// _http:// _http://123greetings.com/ 0.063_http:// _http:// _http://gownprices.com/ 0.181_http:// 0.18_http:// _http://securityselfstorage.us/ 0.151_http:// _http:// _http:// _http:// _http://dressprices.com/ 0.304_http:// _http:// _http://webportal.smsfactory.no/ 0.198_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.09_http:// _http:// _http:// 0.15_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.31_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://vruchtbaarheid.start.nu/ 0.216_http:// _http:// 0.13_http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.14_http:// _http:// _http:// _http://directory.cync.jp/ 0.17_http://link.worlddirectory.com/ 0.0_http:// _http:// _http:// 0.16_http:// _http:// 0.15_http:// _http:// _http:// 0.0_http://search.catcha.com.my/ 0.23_http:// _http:// _http://morphollica.com/ 0.25_http://hotelph.com/ 0.193_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.06_http:// _http://actionsaunas.com/ 0.192_http:// _http:// _http:// _http:// 0.21_http:// _http:// _http:// _http:// _http:// _http://pacosdrivers.com/ 0.17_http:// _http:// _http:// _http:// _http://odp.adbd.com/ 0.101_http:// 0.0_http:// _http:// _http:// 0.04_http://vittle.net/ 0.124_http:// _http:// _http:// _http://bestpricechaser.com/ 0.254_http:// _http:// _http:// _http:// _http:// _http:// 0.29_http:// _http:// 0.24_http:// _http://nodiet4me.net/ 0.067_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// 0.35_http:// _http:// _http:// _http://e-drugstore-guide.com/ 0.149_http:// _http:// _http:// 1.0_http:// _http:// _http:// _http:// _http:// 0.2_http:// _http:// _http:// _http:// 0.22_http:// _http:// _http:// _http:// _http:// _http://home.medicineexpress.co.nz/ 0.109_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://en.mimi.hu/ 0.288_http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.16_http:// _http:// _http://2line.com/ 0.107_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.22_http:// _http:// _http:// _http:// _http:// 0.22_http:// _http:// _http:// _http:// _http:// _http://guruwellness.com/ 0.223_http:// 0.19_http:// _http:// _http:// _http:// _http:// _http:// 0.32_http:// _http:// _http:// _http:// 0.08_http://dmoz.fr/ 0.158_http:// _http:// _http:// _http:// _http:// _http://pscontent.com/ 0.192_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// 0.18_http:// _http://datingfortoday.com/ 0.114_http:// _http:// _http:// _http:// _http:// _http://motivation123.com/ 0.123_http:// _http:// _http://gentshair.com/ 0.332_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.37_http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// 0.2_http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http://structured-settlements.org/ 0.604_http:// _http:// _http://hometown.aol.com/ 0.242_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.36_http:// _http:// 0.24_http:// _http:// _http:// _http:// _http:// _http:// 0.04_http:// _http:// _http:// _http:// _http:// 0.27_http:// _http:// 0.14_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://crayonsandcrafts.com/ 0.186_http:// _http:// 0.26_http:// _http:// _http:// _http:// 0.19_http:// _http:// 0.38_http://allstarhealth.com/ 0.173_http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.04_http:// _http:// 0.33_http:// _http:// _http://ayurvedahelpline.com/ 0.372_http:// 0.16_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.23_http:// _http:// _http:// _http:// _http:// _http:// 0.4_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://dreamyfeet.co.uk/ 0.164_http:// _http:// _http:// _http://heartspring.net/ 0.333_http:// _http://achieve-your-dream.net/ 0.245_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://timautrey.com/ 0.059_http:// _http://hightechapartments.tv/ 0.177_http://viagra-propecia-buy-online.net/ 0.234_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.34_http:// _http:// _http:// _http:// _http:// _http://findink.netfirms.com/ 0.209_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http://karfilemaker.com/ 0.126_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://pub23.bravenet.com/ 0.178_http:// _http:// _http:// _http:// _http:// _http:// _http://stephenholtfitness.com/ 0.216_http:// _http://tell-me-about-it.com/ 0.156_http:// _http:// _http://workathome.biz-whiz.com/ 0.203_http:// 0.1_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://spiritdimension.com/ 0.128_http:// _http:// 0.0_http:// _http:// _http:// _http://ilora.ru/ 0.36_http:// _http:// _http:// _http:// _http:// _http:// 0.06_http:// _http:// _http:// _http://buydietpillonline.com/ 0.371_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.17_http:// _http:// 0.31_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.13_http://mamapriestess.com/ 0.299_http://revbilly.com/ 0.159_http:// _http:// _http:// _http:// 0.18_http:// _http:// _http:// _http://uk.dir.yahoo.com/ 0.265_http:// _http:// _http:// _http:// 0.3_http:// _http:// _http:// _http:// _http:// _http://vitaminmen.com/ 0.295_http://nyw-diet-supplements.com/ 0.22_http:// _http:// _http:// _http:// 0.35_http:// _http://international-pharmacies.com/ 0.3_http:// _http:// _http:// _http:// _http:// _http:// _http://mrealm.com/ 0.0_http://tjburke.com/ 0.076_http:// 0.05_http:// _http://tripleheart.org/ 0.135_http:// 0.05_http:// _http:// _http:// 0.0_http:// _http:// 0.0_http:// 0.09_http://sexshop-romantic.ro/ 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http://acne-treatment.viraltactix.com/ 0.209_http:// _http:// _http:// _http:// _http:// 0.0_http://huminf.uib.no/ 0.089_http:// 0.0_http://len.charest.org/ 0.084_http:// _http://desertviking.nvrland.org/ 0.027_http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.12_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.07_http://workfromhome.com.ua/ 0.178_http:// 0.26_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://penisenlargementsecret.com/ 0.214_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://gambling-online-bonus.biz/ 0.127_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://omega3-zone.com/ 0.264_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.05_http:// _http:// _http://envogueshoes.com/ 0.291_http:// _http:// _http:// _http:// 0.18_http:// _http:// 0.21_http:// _http://excellentdoll.com/ 0.269_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://prostate-health.net/ 0.3_http:// 0.06_http:// 0.33_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.22_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http://weightloss.enikprod.com/ 0.288_http:// _http:// 0.32_http:// _http:// _http:// 0.32_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// 0.18_http:// 0.22_http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// 0.16_http:// _http:// _http://zierbut.com/ 0.245_http:// _http:// _http:// _http://eu.dir.nodeworks.com/ 0.204_http:// _http:// 0.0_http://digilander.libero.it/ 0.156_http:// 0.15_http:// 0.17_http:// _http:// 0.12_http:// _http:// _http:// 0.13_http:// _http:// _http:// 0.24_http:// 0.0_http://web.onetel.net.uk/ 0.262_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// _http://cd.nisim.com/ 0.119_http:// _http:// _http:// _http:// _http://grow-long-hair.dublon.ru/ 0.116_http:// _http:// _http:// _http:// 0.21_http:// _http:// _http:// _http:// _http://dateable.com/ 0.118_http:// _http:// _http:// _http:// _http:// 0.19_http:// 0.27_http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.26_http:// _http:// _http:// _http:// _http://com2.runboard.com/ 0.065_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://parkzone.afternic.com/ 0.22_http:// _http:// _http://villedenice.com/ 0.097_http:// 0.1_http:// _http://vancouver-webpages.com/ 0.097_http:// _http:// _http:// _http://nsonline.com/ 0.3_http:// _http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http:// _http:// _http:// _http:// _http://d5.dir.scd.yahoo.com/ 0.253_http://herbal-nutrition.net/ 0.202_http:// _http://manchester.hotels-booking-server.co.uk/ 0.228_http:// _http:// 0.0_http:// _http:// _http:// _http:// _http:// _http://pub44.bravenet.com/ 0.04_http:// _http://k.st/ 0.31_http:// 0.06_http:// _http:// _http:// 0.2_http:// _http:// _http:// _http:// _http:// _http://xu.frontside.com/ 0.044_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http://research.internetfilter.com/ 0.185_http://007dating.com/ 0.189_http://trial-swing.netfirms.com/ 0.216_http://freedating.a1finder.com/ 0.169_http://onetoonedating.tripod.com/ 0.206_http:// _http://freedatingsingles.tripod.com/ 0.175_http://gotodating.datehunters.com/ 0.171_http://adultfriendfinder.a1finder.com/ 0.306_http:// _http://mega-dating.net/ 0.306_http://matchmakingdirectory.com/ 0.141_http:// _http:// _http:// _http://kiss-personals.4t.com/ 0.128_http://adult_personals_plus.tripod.com/ 0.124_http:// _http:// _http:// 0.0_http:// _http:// 0.23_http:// _http://english.hk.yahoo.com/ 0.114_http:// _http:// _http://eastcoastmuscle.com/ 0.266_http:// / 0.066_http:// _http:// _http:// _http:// _http:// 0.0_http:// _http:// _http://asia.dir.yahoo.com/ 0.057_http:// _http:// _http:// _http:// _http://guide.walla.co.il/ 0.0_http://zoek.zonnet.nl/ 0.0_http://porcino.org/ 0.183_http:// _http:// _http://i-directory.org/ 0.143_http:// _http:// _http:// _http:// 0.12_http://dir.eroticsurf.com/ 0.0_http:// _http:// _http:// 0.18_http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// _http:// 0.19_http:// _http:// _http:// _http:// _http:// 0.0_http://navegandaluz.com/ 0.0_http:// _http:// 0.0_http:// _http:// _http:// 0.0_http:// _http://balivision.com/ 0.118_http://freewebhosting.hostdepartment.com/ 0.0_http:// 0.0_http:// _http:// 0.0_http://boards3.melodysoft.com/ 0.334_http:// forwards, i.e., a page s rank is increased if a trustworthy site links to it. The heuristic in this paper focuses solely on distrust, but in future work we plan to investigate the combination of the two. Following the social process above, we design an algorithm that follows anti-propagandistic practices in order to recognize a spamming network. Our algorithm takes as input a page that the user determined to be untrustworthy. This page could have come to the user through web search results or via the recommendation of some trusted associate (e.g., a society that the user belongs to). In the first stage, our algorithm follows for a few steps the back links of the authority site that contains this page. Thus, we create a portion of the web graph that supports the starting site (its trust graph ). In the process we also sample the contents of the sites in this subgraph to determine their similarity to the starting site s contents. In the second stage the trust graph s structure and contents are analyzed. The subgraph that heavily supports the starting site becomes suspicious for spamming. In our experiments we found that the biconnected component (BCC) of the graph that includes s is such an appropriate subgraph. We divide the sites in the BCC into two groups: Those that have contents similar to s, and those that do not. The former are considered members of a link farm, while the latter are considered members of a mutual admiration society. The members of the link farm are discredited, and the members of the MAS are downgraded (although this result could perhaps be userselectable, for a less conservative filtering). 5. EXPERIMENTAL RESULTS We define the h-trust neighborhood of some site S as the largest biconnected component containing S of the graph composed of the sites that are no more than h links away from S. In our experiments, we examined 3-trust neighborhoods of web sites, both trustworthy and not. We already know that the neighborhoods of spammers and nonspammers cannot be distinguished simply on graph-theoretic terms [14, 5]. They can, however, be recognized from an untrustworthy starting point by backward propagation of distrust. To evaluate the trustworthiness of each site we had an evaluator look at the sites of the BCC. A site was then determined to be either Trustworthy, Untrustworthy, or Nondetermined. The last category includes a variety of sites for which the evaluator could not make a clear determination due to the language used in the site, the subject matter, or the fact that a Blog or Directory can not fall simply into one of the U/T categories. It is perhaps valuable to reiterate here the fact that we are considering trustworthiness to be a personal decision, not an absolute quality of a site. One person s gospel is another s political propaganda, and our goal is to find tools that help individuals make more informed decisions about the quality of the information they find on the web. In selecting the theme to evaluate, we considered a number of commercial, political, medical and financial issues. For the purpose of this paper we focused on a subject that, Powered by yfiles Figure 1: The BCC of target site U-1. according to the evaluator s opinion, was clearly untrustworthy. In particular, the evaluator decided that he does not believe that a product for $39.95 would significantly increase muscle mass without increased exercise, decrease fat without change in diet or habits, enhance sexual performance, increase the good cholesterol while decreasing the bad, regrow hair, decrease blood pressure, remove wrinkles, and increase memory retention. In our experiments, we examined the neighborhoods six such sites, as well as two sites judged to be trustworthy, labeled below as U-1 to U-6 and T-1 to T-2, respectively. Figure 1 is a picture of a successful spammer U-1. The experiment revealed a neighborhood of 1380 sites, 266 of which were connected with 593 edges into a BCC forming the 3-trust neighborhood of the target site. The density and the size of the BCC is comparable to some of the larger BCCs of sites that openly promote reciprocal linking. The algorithm we outlined in the previous section would be very difficult to implement completely and efficiently at the browser side. We convert it to a heuristic that can be implemented on an average workstation and produce results within minutes. The goal of implementing a quick algorithm introduces a number of simplifications explained below. To gather the back links of a site we use the Google API [15]. Still, some sites can have thousands of back links while others have only a few. For this experiment, we limited the number of backlinks for each site to 30. We call this parameter the backlink fan. Determining the similarity of a site to the starting site was done by sampling a few pages of each site. We have noticed from other experiments that when one samples 5 10 pages per site one can get an excellent similarity measure. To speed things up in this experiment, however, we only sampled two pages per site. Even such a small sample, though, produced more than acceptable results. Similarity was determined using the df.idf ranking on the universe of the sites explored. To decide on the cutoff point between LF s and MAS s, we created a site that contained a few random news pages from Reuters. We call this site the divider site. Pages more similar to the target site than is the divider site are categorized as LF sites; others are grouped into MAS sites.
5 Finally, to further reduce the effect of the explosive nature of the web, we introduced the concept of stop sites. A stop site is one that the user believes should not be included in the trust graph either because the trustworthiness of such a site is known or because it cannot be defined. In the first group we placed educational institutions as determined by their URL. In the latter we placed a few well known Directories and Blog sites. We recognize that each of the decisions above can be strengthened, thus strengthening the results of our approach. In particular, increasing the fan parameter above 30 will recognize more sites in the neighborhood. Using a larger universe in calculating similarity and increasing the sampling of pages per site, will give a better approximation of the LF and MAS groups. Our rather conservative approach provided a solid proof of concept for our hypothesis, while remaining fast enough for browser-side implementation. We ran a breadth-first search on the backlinks for each target site looking at the 3-trust neighborhood with a fan of 30 sites. We categorized the sites in the neighborhood into members of a link farm (LF) or members of a mutual admiration society (MAS) based on their similarity to the target site. The sites were then evaluated for trustworthiness. Due to the effort involved, only a randomly chosen 10% of the MAS sites were evaluated. All of the LF sites were evaluated, however. As you can see from the results below, there were almost no trustworthy sites in the 3-trust neighborhoods of the untrustworthy ones. As one might expect, a trustworthy site is unlikely to deliberately link to an untrustworthy one, or even to one that associates with one. Not surprisingly, the statement is not as strong for the trustworthy sites, since untrustworthy sites are free to link to whomever they choose (although thanks to PageRank, spammers are unlikely to want to link to too many sites outside their spamming network in order to avoid leaking rank [5]). Target T(LF) T(MAS) U(LF) U(MAS) U-1 0% 0% 96% 88% U-2 0% 0% 100% 65% U-3 0% 0% 95% 100% U-4 0% 0% 88% 83% U-5 0% 0% 100% 73% U-6 11% 9% 89% 57% T-1 86% 70% 14% 0% T-2 64% 64% 33% 13% Our experiments showed that the quality of the starting site was a very good predictor for the quality of the BCC sites. While most of the results above show no accidental or erroneous linking of a trustworthy site to an untrustworthy one, we found such evidence in one experiment. 6. RELATED WORK Web spamming has received a lot of attention lately [1, 3, 4, 5, 12, 13, 19, 21, 22, 24, 27, 29, 30, 33]. The first papers to raise the issue were [30, 22]. The spammers success was noted in [13, 12, 10, 2, 4, 16, 23]. Characteristics of spamming sites based on diversion from power laws are presented in [12]. Current tricks employed by spammers are detailed in [18]. An analysis of the popular PageRank method employed by many search engines today and ways to maximize it in a spamming network is described in [5]. A modification to the PageRank to take into account opinions of human editors, employees of a search engine, is presented in [19]. A comprehensive treatment on social networks is presented in [38]. The connection between web spammers and social propagandists, and how the evolution of search engines can be understood as response to spammers is presented in [32]. Propagation methods for trust and distrust are discussed in [17]. Some work on personalized web search is presented in [20, 25]. The effect that search engines have on page popularity was discussed in [9]. 7. CONCLUSIONS AND FUTURE DIREC- TIONS In this paper we have argued that web spam is to cyberworld what propaganda is to society. As far as we know, this is the first time this relationship is noted. As evidence of this analogy and its importance, we have shown that the evolution of search engines can be simply understood as the search engines response to defend against spam. New search engines are not invented every few years, as it is sometimes reported; they are developed when researchers have a good answer to spam. More importantly, we have shown that this relationship can guide us towards developing heuristics that recognize spammers. In particular, we have presented automatic ways of recognizing trust neighborhoods on the web based on the biconnected component around some starting site. Experimental results from a number of such instances show our algorithm s ability of recognizing parts of a spamming network. With such results, the question arises as to what one should do once one recognizes a spamming network. This is a question that has not attracted much attention in the past. The obvious approach is that a search engine would delete such networks from its indices [12] or might downgrade them by some prespecified amount [19] as it has been reported in the past [36]. Both of these approaches, however, require a universal agreement of what constitutes spam. Such an agreement cannot exist; one person s spam may be another person s treasure. Should the search engines determine what is trustworthy and what is not? Willing or not, they are the de facto arbiters of what information users see. As in a well-known cartoon, the kid responds to the old man who has been looking all his life for the meaning of life: If it is not on Google or ebay, it does not exist. We believe that it is the users right and responsibility to decide what is acceptable for them. Their browser, their window to cyberworld, should enhance their ability to make this decision. User education is fundamental: People should know how search engines work and why, and how information appears on the web. But they should also have a browser that can help them determine the validity and trustworthiness of information. The tool we described in an earlier section is a first step in this direction. Ultimately, it would be used along with a set of trust certificates that contains the portable trust preferences of the user, a set of preferences that the user can accumulate over time. A combination of search engines capable of providing indexed content and structure, including identified neighborhoods, with a browser capable of filtering those neighborhoods through the user s trust preferences,
6 would provide a new level of reliability to the user s information gathering. 8. ACKNOWLEDGEMENTS The authors would like to thank Mirena Chausheva, Meredith Beaton-Lacoste and Scott Dynes for their valuable contributions. 9. REFERENCES [1] B. Amento, L. Terveen, and W. Hill. Does authority mean quality? Predicting expert quality ratings of web documents. In Proceedings of the Twenty-Third Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, [2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1):2 43, June [3] K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12): , [4] K. Bharat, B.-W. Chang, M. R. Henzinger, and M. Ruhl. Who links to whom: Mining linkage between web sites. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages IEEE Computer Society, [5] M. Bianchini, M. Gori, and F. Scarselli. PageRank and web communities. In Web Intelligence Conference 2003, Oct [6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1 7): , [7] A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3 10, [8] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, pages North-Holland Publishing Co., [9] J. Cho and S. Roy. Impact of search engines on page popularity. In WWW 2004, May [10] T. S. Corey. Catching on-line traders in a web of lies: The perils of internet stock fraud. Ford Marrin Esposito, Witmeyer & Glesser, LLP, May [11] G. Cybenko, A. Giani, and P. Thompson. Cognitive hacking: A battle for the mind. Computer, 35(8):50 56, [12] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics. In WebDB2004, June [13] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the twelfth international conference on World Wide Web, pages ACM Press, [14] G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self-organization of the web and identification of communities. IEEE Computer, 35(3):66 71, [15] I. Google. The Google api. [16] L. Graham and P. T. Metaxas. Of course it s true; i saw it on the internet! : Critical thinking in the internet era. Commun. ACM, 46(5):70 75, [17] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In WWW 2004, May [18] Z. Gyongui and H. Garcia-Molina. Web spam taxonomy. Technical Report TR , Stanford University, [19] Z. Gyongui, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In VLDB 2004, Aug [20] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the eleventh international conference on World Wide Web, pages ACM Press, [21] M. R. Henzinger. Hyperlink analysis for the web. IEEE Internet Computing, 5(1):45 50, [22] M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11 22, [23] M. Hindman, K. Tsioutsiouliklis, and J. Johnson. Googlearchy: How a few heavily-linked sites dominate politics on the web. In Annual Meeting of the Midwest Political Science Association, April [24] L. Introna and H. Nissenbaum. Defining the web: The politics of search engines. Computer, 33(1):54 62, [25] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the twelfth international conference on World Wide Web, pages ACM Press, [26] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): , [27] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999), 31(11 16): , [28] A. Lee and E. L. (eds.). The Fine Art of Propaganda. The Institute for Propaganda Analysis. Harcourt, Brace and Co., [29] C. A. Lynch. When documents deceive: trust and provenance as new factors for information retrieval in a tangled web. J. Am. Soc. Inf. Sci. Technol., 52(1):12 17, [30] M. Marchiori. The quest for correct information on the web: hyper search engines. Comput. Netw. ISDN Syst., 29(8-13): , [31] M. L. Maulding. Lycos: Design choices in an internet search service. IEEE Expert, January-February( ):8 11, [32] P. T. Metaxas. Web spam: An application of Propaganda theory. Technical Report CSD-TR , Wellesley College, [33] G. Pringle, L. Allison, and D. L. Dowe. What is a tall poppy among web pages? In Proceedings of the seventh international conference on World Wide Web 7, pages Elsevier Science Publishers B. V., 1998.
7 [34] G. Salton. Dynamic document processing. Commun. ACM, 15(7): , [35] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6 12, [36] M. Totty and M. Mangalindan. As google becomes web s gatekeeper, sites fight to get in. In Wall Street Journal CCXLI(39), February [37] A. Vedder. Medical data, new information technologies and the need for normative principles other than privacy rules. In Law and Medicine. M. Freeman and A. Lewis (Eds.), (Series Current Legal Issues), pages Oxford University Press, [38] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
Web Spam, Propaganda and Trust
Web Spam, Propaganda and Trust Panagiotis T. Metaxas Computer Science Department Wellesley College Wellesley, MA 02481, USA [email protected] Joseph DeStefano Math and Computer Science Department
A few legal comments on spamdexing
A few legal comments on spamdexing by Gerrit Vandendriessche 1 Altius, Brussels What is spamdexing? The Internet contains a lot of information. In 2002, the total number of web pages was estimated at 2.024
Search engine ranking
Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Removing Web Spam Links from Search Engine Results
Removing Web Spam Links from Search Engine Results Manuel EGELE [email protected], 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features
Web Search. 2 o Semestre 2012/2013
Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd
70 % MKT 13 % MKT % MKT 16 GUIDE TO SEO
70 % MKT 13 % MKT 16 % MKT GUIDE TO SEO Overview Search engines are essentially directories that contain and organize much of the information available on the internet. The three largest search engines
Disclaimer. The author in no case shall be responsible for any personal or commercial damage that results due to misinterpretation of information.
1 Disclaimer 2013 Solutions From Paradise, LLC No part of this ebook can be reproduced, stored, or transmitted by any means including recording, scanning, photocopying, electronic or print without written
A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION
Volume 4, No. 1, January 2013 Journal of Global Research in Computer Science REVIEW ARTICLE Available Online at www.jgrcs.info A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION 1 Er.Tanveer Singh, 2
CS 558 Internet Systems and Technologies
CS 558 Internet Systems and Technologies Dimitris Deyannis [email protected] 881 Heat seeking Honeypots: Design and Experience Abstract Compromised Web servers are used to perform many malicious activities.
Search Engine Submission
Search Engine Submission Why is Search Engine Optimisation (SEO) important? With literally billions of searches conducted every month search engines have essentially become our gateway to the internet.
Keywords the Most Important Item in SEO
This is one of several guides and instructionals that we ll be sending you through the course of our Management Service. Please read these instructionals so that you can better understand what you can
Worst Practices in. Search Engine Optimization. contributed articles
BY ROSS A. MALAGA DOI: 10.1145/1409360.1409388 Worst Practices in Search Engine Optimization MANY ONLINE COMPANIES HAVE BECOME AWARE of the importance of ranking well in the search engines. A recent article
Website Standards Association. Business Website Search Engine Optimization
Website Standards Association Business Website Search Engine Optimization Copyright 2008 Website Standards Association Page 1 1. FOREWORD...3 2. PURPOSE AND SCOPE...4 2.1. PURPOSE...4 2.2. SCOPE...4 2.3.
Search Engine Optimization (SEO): Improving Website Ranking
Search Engine Optimization (SEO): Improving Website Ranking Chandrani Nath #1, Dr. Laxmi Ahuja *2 # 1 *2 Amity University, Noida Abstract: - As web popularity increases day by day, millions of people use
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
International Journal Of Advance Research In Science And Engineering IJARSE, Vol. No.2, Issue No.7, July 2013
http:// IMPLEMENTATION OF SELECTED SEO TECHNIQUES TO INCREASE VISIBILITY, VISITORS AND ACHIEVE HIGHER RANKINGS IN GOOGLE SEARCH RESULTS FOR NEWLY CREATED WEBSITES K M Patel 1, Prof. Rajesh Pathak 2 1,2
ONLINE ADVERTISING (SEO / SEM & SOCIAL)
ONLINE ADVERTISING (SEO / SEM & SOCIAL) BASIC SEO (SEARCH ENGINE OPTIMIZATION) Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine's
Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development
Narkevičius. Search engine optimization. 3 Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development Robertas Narkevičius a Vilnius Business College, Kalvariju street 125,
Top 25 Email Marketing Terms You Should Know. Email Marketing from Constant Contact
Email Marketing from Constant Contact Top 25 Email Marketing Terms You Should Know Constant Contact, Inc. 1601 Trapelo Road, Suite 329 Waltham, MA 02451 www.constantcontact.com If you have ever felt out
SEO Best Practices Checklist
On-Page SEO SEO Best Practices Checklist These are those factors that we can do ourselves without having to rely on any external factors (e.g. inbound links, link popularity, domain authority, etc.). Content
Corso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa [email protected] tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
SEO Guide for Front Page Ranking
SEO Guide for Front Page Ranking Introduction This guide is created based on our own approved strategies that has brought front page ranking for our different websites. We hereby announce that there are
Part 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please
WHITE HAT SEO TECHNIQUES It s not about gaming Google. It s simply about creating content that people want and that satisfies their search intent.
WHITE HAT SEO TECHNIQUES It s not about gaming Google. It s simply about creating content that people want and that satisfies their search intent. White Hat SEO Techniques: It's not about gaming Google
Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology
Attracting Buyers with Search, Semantic, and Recommendation Technology Learning Objectives Using Search Technology for Business Success Organic Search and Search Engine Optimization Recommendation Engines
Why SEO? What is Search Engine Optimization? Our main areas of expertise are: When a company invests in building a website, their goal is to:
Super Web Solutions - a one-stop-shop for all your web design, development, marketing, hosting and business consulting needs. We offer all the essential tools to guarantee that your business prospers and
Hybrid Approach to Search Engine Optimization (SEO) Techniques
Suresh Gyan Vihar University Journal of Engineering & Technology (An International Bi Annual Journal) Vol. 1, Issue 2, 2015, pp.1-5 ISSN: 2395 0196 Hybrid Approach to Search Engine Optimization (SEO) Techniques
DIGITAL MARKETING BASICS: SEO
DIGITAL MARKETING BASICS: SEO Search engine optimization (SEO) refers to the process of increasing website visibility or ranking visibility in a search engine's "organic" or unpaid search results. As an
COMPANY INTRODUCTION. 201-33119 South Fraser Way Abbotsford, BC V2S 2B1. 888.262.6687 [email protected] www.1stonthelist.ca
COMPANY INTRODUCTION 201-33119 South Fraser Way Abbotsford, BC V2S 2B1 888.262.6687 [email protected] www.1stonthelist.ca INTRODUCTION RE: Partnering with 1st on the List Dear Reader: I would like
SEO is one of three types of three main web marketing tools: PPC, SEO and Affiliate/Socail.
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
Understanding SEO. www.3mediaweb.com. Web Development Responsive Design WordPress For Agencies
Understanding SEO www.3mediaweb.com Web Development Responsive Design WordPress For Agencies Understanding SEO IN BUSINESS, IT S IMPORTANT TO BE IN THE RIGHT PLACE at the right time. Online business is
Search Engine Optimisation (SEO) Guide
Search Engine Optimisation (SEO) Guide Search Engine Optimisation (SEO) has two very distinct areas; on site SEO and off site SEO. The first relates to all the tasks that you can carry out on your website
Four Keys: Enhancing Search Engine Optimization
Four Keys: Enhancing Search Engine Optimization A Quick Guide for Vacation Rental Property Managers Driving Website Traffic Presented by LiveRez Introduction In the past few years, the utilization of search
an Essential Marketing Grow Your Business Online: The Commercial Approach to Search Engine Marketing Prepared by Flex4, December 2011, v1.
A Science, a Black Art or an Essential Marketing tool? Grow Your Business Online: The Commercial Approach to Search Engine Marketing Prepared by Flex4, December 2011, v1. Grow Your Business Online P a
Our SEO services use only ethical search engine optimization techniques. We use only practices that turn out into lasting results in search engines.
Scope of work We will bring the information about your services to the target audience. We provide the fullest possible range of web promotion services like search engine optimization, PPC management,
Web Hosting Tips & Tricks For Affiliates
Web Hosting Tips & Tricks For Affiliates References http://hagency.com/top-web-hosts.php http://minisitemanager.com http://hagency.com/enom Introduction Hosting a website can be a very confusing experience
101 Basics to Search Engine Optimization. (A Guide on How to Utilize Search Engine Optimization for Your Website)
101 Basics to Search Engine Optimization (A Guide on How to Utilize Search Engine Optimization for Your Website) Contents Introduction Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Why Use
Search engines: ranking algorithms
Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated
Top 12 Website Tips. How to work with the Search Engines
Top 12 Website Tips 1. Put your website at the heart of your marketing strategy 2. Have a clear purpose for your website 3. Do extensive SEO keyword research 4. Understand what your online competitors
HubSpot's Website Grader
Website Grader Subscribe Badge Website Grader by HubSpot - Marketing Reports for 400,000 URLs and Counting... Website URL Ex: www.yourcompany.com www.techvibes.com Competing Websites (Optional) Enter websites
Search Engine Optimization for Higher Education. An Ingeniux Whitepaper
Search Engine Optimization for Higher Education An Ingeniux Whitepaper This whitepaper provides recommendations on how colleges and universities may improve search engine rankings by focusing on proper
Emerging Trends in Fighting Spam
An Osterman Research White Paper sponsored by Published June 2007 SPONSORED BY sponsored by Osterman Research, Inc. P.O. Box 1058 Black Diamond, Washington 98010-1058 Phone: +1 253 630 5839 Fax: +1 866
Search Engine Optimization Glossary
Search Engine Optimization Glossary A ALT Text/Tag or Attribute: A description of an image in your site's HTML. Unlike humans, search engines read only the ALT text of images, not the images themselves.
5 Tips to Turn Your Website into a Marketing Machine
Internet Marketing Whitepaper: 5 Tips to Turn Your Website into a Marketing Machine Brought to you by www.hubspot.com Part 1: How the Internet has transformed businesses The Internet has profoundly transformed
SIP Service Providers and The Spam Problem
SIP Service Providers and The Spam Problem Y. Rebahi, D. Sisalem Fraunhofer Institut Fokus Kaiserin-Augusta-Allee 1 10589 Berlin, Germany {rebahi, sisalem}@fokus.fraunhofer.de Abstract The Session Initiation
Scope Of Services At Dataflurry Prospectus
------------- Scope Of Services At Dataflurry Prospectus Prospective Client Overview Overview Of Services Offered By Dataflurry To Increase Targeted Search Engine Traffic Overview Of Methodologies Used
Successful Search Engine Marketing
Axandra Proven Methods For Successful Search Engine Marketing How to: Invest 1 for your web site success! ü Ü Get and maintain top 10 rankings on Google, Yahoo, MSN and other major search engines. Ü Get
SEO is one of three types of three main web marketing tools: PPC, SEO and Affiliate/Socail.
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
Online Reputation Management Services
Online Reputation Management Services Potential customers change purchase decisions when they see bad reviews, posts and comments online which can spread in various channels such as in search engine results
http://www.khumbulaconsulting.co.za/wp-content/uploads/2016/03/khumbulaconsulting-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
101 Basics to Search Engine Optimization
101 Basics to Search Engine Optimization (A Guide on How to Utilize Search Engine Optimization for Your Website) By SelfSEO http://www.selfseo.com For more helpful SEO tips, free webmaster tools and internet
http://www.silverstoneguesthouse.co.za/wp-content/uploads/2015/11/silverstone-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.
RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,
Digital media glossary
A Ad banner A graphic message or other media used as an advertisement. Ad impression An ad which is served to a user s browser. Ad impression ratio Click-throughs divided by ad impressions. B Banner A
http://www.boundlesssound.co.za/wp-content/uploads/2016/02/boundlesssound-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
http://firstsourcemoney.org/wp-content/uploads/2015/05/firstsourceholdings-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
Top 40 Email Marketing Terms You Should Know
1601 Trapelo Road Phone 781-472-8100 Suite 246 Fax 781-472-8101 Waltham, MA 02451 www.constantcontact.com Top 40 Email Marketing Terms You Should Know If you have ever felt out of your depth in a discussion
Search engine marketing
Search engine marketing 1 Online marketing planning PHASE 1 Current marketing situation analysis PHASE 2 Defining Strategy Setting Web Site Objective PHASE 3 Operational action programmes PHASE 4 Control
SEO Analysis Guide CreatorSEO easy to use SEO tools
CreatorSEO Analysis Guide Updated: July 2010 Introduction This guide has been developed by CreatorSEO to help our clients manage their SEO campaigns. This guide will be updated regularly as the Search
Increasing Traffic to Your Website Through Search Engine Optimization (SEO) Techniques
Increasing Traffic to Your Website Through Search Engine Optimization (SEO) Techniques Small businesses that want to learn how to attract more customers to their website through marketing strategies such
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
enhanced landing page groups and meetings template guidelines
enhanced landing page groups and meetings template guidelines table of contents groups and meetings templates 03 groups and meetings template specifications 04 web best practices 05 writing for enhanced
http://www.panstrat.co.za/wp-content/uploads/2015/05/panstrat-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
http://www.panstrat.co.za/wp-content/uploads/2015/05/panstrat-seo-certificate.pdf Domain
SEO Search Engine Optimization ~ Certificate ~ The most advance & independent SEO from the only web design company who has achieved 1st position on google SA. Template version: 2nd of April 2015 For Client
graphical Systems for Website Design
2005 Linux Web Host. All rights reserved. The content of this manual is furnished under license and may be used or copied only in accordance with this license. No part of this publication may be reproduced,
