Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot,

Similar documents
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Search Engine Optimization Content is Key. Emerald Web Sites-SEO 1

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Website Audit Reports

Search Engine Optimization (SEO): Improving Website Ranking

Corso di Biblioteche Digitali

WEBSITE MARKETING REVIEW

SEARCH ENGINE OPTIMIZATION

Yandex: Webmaster Tools Overview and Guidelines

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Table of contents. HTML5 Data Bindings SEO DMXzone

Review of Generated on 9 Jan, :40 PM SCORE. Table of Contents. Iconography. SEO Mobile Social Sharing

1. SEO INFORMATION...2

Challenges in Running a Commercial Web Search Engine. Amit Singhal

Instead, there should only be one URL for the home page to avoid duplication.

Search engine optimization: Black hat Cloaking Detection technique

Removing Web Spam Links from Search Engine Results

IRLbot: Scaling to 6 Billion Pages and Beyond

Search and Information Retrieval

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Geo Targeting Server location, country-targeting, language declarations & hreflang

Analysis of Web Archives. Vinay Goel Senior Data Engineer

EVILSEED: A Guided Approach to Finding Malicious Web Pages

CS 558 Internet Systems and Technologies

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

Pizza SEO: Effective Web. Effective Web Audit. Effective Web Audit. Copyright Pizza SEO Ltd.

Website Standards Association. Business Website Search Engine Optimization

BotSeer: An automated information system for analyzing Web robots

Prepared for: Harry. Website: fortyfootpictures.com. By: Patrick Slavin

Search Engine Optimization for a WebSphere Commerce System

62 Ecommerce Search Engine Optimization Tips & Ideas

Web Crawlers Detection

Search Engine Optimization Glossary

Search Engine Optimization (SEO) & Positioning

30 Website Audit Report. 6 Website Audit Report. 18 Website Audit Report. 12 Website Audit Report. Package Name 3

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

77% 77% 42 Good Signals. 16 Issues Found. Keyword. Landing Page Audit. credit. discover.com. Put the important stuff above the fold.

The Ultimate Guide to Magento SEO Part 1: Basic website setup

A few legal comments on spamdexing

Increasing Traffic to Your Website Through Search Engine Optimization (SEO) Techniques

Criteria for web application security check. Version

An Advanced SEO Website Audit Checklist

SE Ranking Report

An Approach to Give First Rank for Website and Webpage Through SEO

Framework for Intelligent Crawler Engine on IaaS Cloud Service Model

Determining Bias to Search Engines from Robots.txt

How to get your Website listed with Search Engines and Directories

Promoting your Site: Search Engine Optimisation and Web Analytics

W3Perl A free logfile analyzer

Integrated Network Vulnerability Scanning & Penetration Testing SAINTcorporation.com

HP WebInspect Tutorial

SEO Best Practices Checklist

Google Analytics for Robust Website Analytics. Deepika Verma, Depanwita Seal, Atul Pandey

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

SEO BEST PRACTICES IPRODUCTION. SEO Best Practices. November 2013 V3. 1 iproduction

Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development

Search Engine Optimization. Software Engineering October 5, 2011 Frank Takes LIACS, Leiden University

Online terminologie 1. % Exit The percentage of users who exit from a page. Active Time / Engagement Time

Best Practices for WordPress and SEO

SearchEngine Optimization. OptimizeOnline Howto BuildyourPageRank

Best Practice Search Engine Optimisation

Seven Steps to Becoming an SEO Superhero

DIGITAL MARKETING BASICS: SEO

Small Business SEO Marketing an introduction

Atomic Hunter Atomic Hunter URL Search Hunt Advanced... Keyword Search Common Settings

Quick SEO Tips Your Web Designer Needs to Know

Why SEO? What is Search Engine Optimization? Our main areas of expertise are: When a company invests in building a website, their goal is to:

SEO AND CONTENT MANAGEMENT SYSTEM

SEO Definition. SEM Definition

Our SEO services use only ethical search engine optimization techniques. We use only practices that turn out into lasting results in search engines.

Combating Web Fraud with Predictive Analytics. Dave Moore Novetta Solutions

On-Page SEO (One Time Fee) What is On-Page SEO?

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

EVALUATING COMMERCIAL WEB APPLICATION SECURITY. By Aaron Parke

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

User Guide to the Content Analysis Tool

10 Things You Must Know Before Redesigning Your Website

How to Move Content & Preserve SEO in a Website Redesign

Findability Strategy Checklist

How to make the most of search engine marketing (SEM)

The Ultimate Magento

Front-End Performance Testing and Optimization

80+ Things Every Marketer Needs to Know About Their Website

SEO FACTORS & TRENDS January 2010

Search Engine Optimization. A Free Basics Guide?

Website Search Engine Optimization. Presented by: David Backes Senior Account Executive Anvil Media Inc.

TRAINING EXPERT INSTITUTE. Certificate Course in SEO. Weekdays Weekends

Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.

Arya Progen Technologies & Engineering India Pvt. Ltd.

SPAM FILTER Service Data Sheet

Report for consumerhealthdigest.com

What Does Your Search Engine Results Mean To You

Website Migrations. Jonathan Hochman

6.1.6 Optimize internal links Search engine friendly URLs Add anchor text to links 6.2 Keywords Optimize keywords 6.2.

Anchor s Marketing Glossary and Common Terms

46% 46% 34 Good Signals. 24 Issues Found. Keyword. Landing Page Audit. financial advisor.

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

101 Basics to Search Engine Optimization. (A Guide on How to Utilize Search Engine Optimization for Your Website)

~~Free# SEO Suite Ultimate free download ]

New Features Relevant to Webmasters

Transcription:

IR - STP07 Lecture 3: Web Crawling Martin Hassel xmartin@dsv.su.se DSV - Department of Computer and Systems Sciences Stockholm University and KTH - Royal Institute of Technology Web Crawling Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled. In parts based on material by Bing Liu & Filippo Menczer 1 2 Many Names Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot, Motivation for Crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners. Monitor Web sites of interest Evil: harvest emails for spamming, phishing Can you think of some others? 3 4 Basic Crawler This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything Graph Traversal Breadth First Search Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with good pages, this keeps us close to the topic 5 6 1

Preferential Crawlers Depth First Search Implemented with STACK (LIFO) Wander away ( lost in cyberspace ) What about dynamic pages (URLs)? 7 Implementation Issues Don t want to fetch same page twice! Keep lookup table (hash) of visited pages. What if not visited but in frontier already? The frontier grows very fast! May need to prioritize for large crawls Fetcher must be robust! Don t crash if download fails, timeout mechanism Detect and break redirection loops Determine file type to skip unwanted files Can try using extensions, but not reliable Can issue HEAD HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests 8 Parsing Relative vs. Absolute URLs HTML has the structure of a Document Object Model tree. Unfortunately actual HTML is often incorrect in a strict syntactic sense. Crawlers, like browsers, must be robust/forgiving Must pay attention to HTML entities and unicode in text What to do with a growing number of other formats? Flash, SVG, RSS, AJAX The Crawler must translate relative URLs into absolute URLs Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default Examples Base: http://www.cnn.com/linkto/ Relative URL: intl.html Absolute URL: http://www.cnn.com/linkto/intl.html Relative URL: /US/ Absolute URL: http://www.cnn.com/us/ 9 10 URL Canonicalization All of these: http://www.cnn.com/tech http://www.cnn.com/tech/ http://www.cnn.com:80/tech/ http://www.cnn.com/bogus/../tech/ Are really equivalent to this canonical form: http://www.cnn.com/tech Why is this a problem? In order to avoid duplication, the crawler must transform all URLs into canonical form! Spider Traps Misleading sites: indefinite number of pages dynamically generated by CGI scripts Paths of arbitrary depth created using soft directory links and path rewriting features in HTTP server Only heuristic defensive measures: Check URL length; assume spider trap above some threshold, for example 128 characters Watch for sites with very large number of URLs Eliminate URLs with non-textual data types May disable crawling of dynamic pages (if can detect) 11 12 2

Universal Crawlers Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is justified by the many queries from lots and lots of users Incremental updates to existing index and other data repositories Large-Scale Universal Crawlers Two major issues: 1. Performance Need to scale up to billions of pages 2. Policy Need to trade-off coverage, freshness, and bias (e.g. toward important pages) 13 14 Policy Issues Coverage New pages get added all the time Can the crawler find every page? Freshness Pages change over time, get removed, etc. How frequently can a crawler revisit? Trade-off! Focus on most important pages (crawler bias)? Importance is subjective 15 Maintaining a Fresh Index Universal crawlers are never done High variance in rate and amount of page changes HTTP headers are notoriously unreliable last-modified, expires Solution? Estimate the probability that a previously visited page has changed since last visit However, most changes to the web are due to the addition of pages, not revision of existing ones 16 Coverage There is an abundance of pages in the Web If we cover too much, it will get stale For PageRank, pages with very low prestige are largely l useless What is the goal? General search engines: pages with high prestige News portals: pages that change often Vertical portals: pages on some topic What are appropriate priority measures in these cases? Approximations? 17 Preferential Crawlers Assume we can estimate for each page an importance measure, I(p) Want to visit pages in order of decreasing I(p) Maintain the frontier as a priority queue sorted by I(p) 18 3

Possible Criteria Selective bias toward some pages, eg. most relevant /topical, closest to seeds, most popular/largest PageRank, unknown servers, highest rate/amount of change, etc Focused crawlers Supervised learning: classifier based on labeled examples Topical crawlers Best-first search based on similarity(topic, parent) Adaptive crawlers Reinforcement learning Evolutionary algorithms/artificial life Examples Breadth-First Exhaustively visit all links in order encountered Best-N-First N Priority queue sorted by similarity, explore top N at a time Variants: DOM context, hub scores PageRank Priority queue sorted by keywords, PageRank 19 20 More Eamples SharkSearch Priority queue sorted by combination of similarity, anchor text, similarity of parent, etc. InfoSpiders Adaptive distributed algorithm using an evolving population of learning agents Focused Crawlers Use a classifier* based on example seed pages in the desired d topic Score each incoming page and determine if going deeper * For instance a Naïve-Bayes classifier 21 22 Naïve Best-First Can have multiple topics with as many classifiers, with scores appropriately combined (Chakrabarti et al. 1999) Can use a attempt to find topical hubs periodically, and add these to the frontier Use alternative classifier algorithms to Naïve-Bayes, e.g. SVM and neural nets have reportedly performed better (Pant & Srinivasan 2005) Simplest topical crawler: Frontier is priority queue based on text similarity between topic and parent page, e.g. Giving more importance to certain HTML markup in parent page Extending text representation of parent page with anchor text from grandparent pages (SharkSearch) Exploiting topical locality (co-citation) Limiting link context to less than entire page 23 24 4

Link Context Based on Text Neighborhood Consider a fixed-size window, e.g. 50 words around anchor Can weigh links based on their distance from topic keywords within the document (InfoSpiders, Clever) Anchor text deserves extra importance Exploration vs. Exploitation Best-N-First (or BFSN) Rather than re-sorting the frontier every time you add links, be lazy and sort only every N pages visited Empirically, being less greedy helps crawler performance significantly: escape local topical traps by exploring more 25 26 Crawler Ethics and Conflicts Crawlers can cause trouble, even unwillingly, if not properly designed to be polite and ethical For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack! Server administrator and users will be upset Crawler developer/admin IP address may be blacklisted Crawler Etiquette Identify yourself Use User-Agent HTTP header to identify crawler Use From HTTP header to specify crawler developer email Do not disguise crawler as a browser by using their User-Agent string Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem 27 28 More Crawler Etiquette Pay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.: redirection loops spider traps Spread the load, do not overwhelm a server No more than some max. number of requests to any single server per unit time, say < 1/second And Even More... Honor the Robot Exclusion Protocol A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named robots.txt placed in the HTTP root directory, e.g. http://www.dsv.su.se/robots.txt Crawler should always check, parse, and obey this file before sending any requests to a server More info at: http://www.robotstxt.org/wc/exclusion.html 29 30 5

Example http://www.apple.com/robots.txt # robots.txt for http://www.apple.com/ User-agent: * can go anywhere! All crawlers And Two More... # Robots.txt file for http://www.microsoft.com User-agent: * /canada/library/mnp/2/aspx/ /communities/bin.aspx /communities/eventdetails.mspx /communities/blogs/portalresults.mspx /communities/rss.aspx /downloads/browse.aspx /downloads/info.aspx /france/formation/centres/planning.asp /france/mnp_utility.mspx /germany/library/images/mnp/ /germany/mnp_utility.mspx /info/customerror.htm /info/smart404.asp #etc All crawlers are not allowed in these paths 31 32 # Robots.txt for http://www.springer.com (fragment) User-agent: Googlebot /chl/* /uk/* /italy/* /france/* User-agent: slurp Crawl-delay: 2 User-agent: MSNBot Crawl-delay: 2 User-agent: scooter # all others User-agent: * / Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off! Misuse Some crawlers disguise themselves Using false User-Agent Randomizing access frequency to look like a human/browser Example: click fraud for ads 33 34 The Server Side Servers can disguise themselves, too Cloaking: present different content based on User-Agent. E.g. stuff keywords on version of page shown to search engine crawler. Search engines do not look kindly on this type of spamdexing and remove from their index sites that perform such abuse. Questions? 35 36 6