Basic crawler operation. Crawler (AKA Spider) (AKA Robot) (AKA Bot) Basic Crawler. Downloading Web Pages. Crawling picture
|
|
- Vincent Powers
- 7 years ago
- Views:
Transcription
1 Basic crawler operation Sec Crawler (AKA Spider) (AKA Robot) (AKA Bot) Begin with known seed s and parse them Extract s they point to Place the extracted s on a queue each on the queue and repeat Some of these slides are based on Stanford IR Course slides at Basic Crawler Downloading Web Pages Frontier Will this terminate? remove( ) getpage( ) Suppose our robot wants to download: getlinksfrompage( ) 3 4 HTTP Request Web Server HTTP Response Webdata/index.html File System Got it! Now, lets parse it and continue on Actually more complicated; Server replies with redirection 5 Web s crawled and parsed Seed pages Crawling picture s frontier Unseen Web Sec
2 Structure of the Web Simple picture complications Sec What does this mean for seed choice? 7 Web crawling isn t feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters stipulations How deep should you crawl a site s hierarchy? Site mirrors and duplicate pages Politeness don t hit a server too often 8 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Sec What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Sec Respect.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources 9 10 What any crawler should do Sec Updated crawling picture Sec pages of higher quality first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols 11 s crawled and parsed Seed Pages Crawling thread frontier Unseen Web 12 2
3 Processing steps in crawling Basic crawl architecture Pick a from the frontier the document at the the Extract links from it to other docs (s) Check if has content already seen If not, add to indexes For each extracted Ensure it passes certain tests Which one? E.g., only crawl.edu, obey.txt, etc. Check if it is already in the frontier (duplicate ination) 13 Frontier 14 Basic crawl architecture frontier Sec Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Frontier Basic crawl architecture (Domain Name Server) Sec A lookup service on the internet Given a, retrieve its IP address Service provided by a distributed of servers thus, lookup latencies can be high (even seconds) Common OS implementations of lookup are blocking: only one outstanding request at a time Solutions caching Frontier Batch resolver collects requests and sends them 17 out together 18 3
4 Basic crawl architecture lication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles Second part of this lecture Frontier Basic crawl architecture Explicit and implicit politeness Sec Explicit politeness: specifications from webmasters on what portions of site can be crawled.txt Implicit politeness: even with no specification, avoid hitting any site too often Frontier Robots.txt Robots Exclusion Protocol Protocol for giving spiders ( ) limited access to a website, originally from Website announces its request on what can(not) be crawled For a server, create a file /.txt This file specifies access restrictions Put the file.txt at the root directory of the server Crawlers are expected to follow these instructions
5 .txt Format A.txt file consists of several records Each record consists of a of some crawler id s and a of s these crawlers are not allowed to visit User-agent lines: Names of crawlers Disallowed lines: Which s are not to be visited by these crawlers (agents)? User-agent: * Disallow: / User-agent: * Disallow: User-agent: Google Disallow: User-agent: * Disallow: / Examples User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ User-agent: BadBot Disallow: / Robots Meta Tag Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name, inside the HTML file Format: <meta name="" content="options"/> Options: index or noindex: index or do not index this file follow or nofollow: follow or do not follow the links of this file 27 <html> <head> <meta name="" content="noindex,follow"> <title>...</title> </head> <body> An Example: How should a crawler act when it visits this page? 28 Revisit Meta Tag Stronger Restrictions Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example: <meta name="revisit-after" content="10 days" /> 29 It is possible for a (non-polite) crawler to ignore the restrictions imposed by.txt and meta directions Therefore, if one wants to ensure that automatic do not visit her resources, she has to use other mechanisms For example, password protections 30 5
6 Interesting Questions Basic crawl architecture Can you be sued for copyright infringement if you don t follow the.txt protocol? Can you be sued for copyright infringement if you follow the.txt protocol? Can the site owner determine who is actually crawling their site? Can the site owner protect against unauthorized access by specific? Frontier Parsing: normalization licate ination When a fetched document is parsed, some of the extracted links are relative s E.g., has a relative link to /wiki/wikipedia:general_disclaimer which is the same as the absolute aimer For a non-continuous (one-shot) crawl, test to see if an extracted+ed has already been passed to the frontier For a continuous crawl see details of frontier implementation During parsing, must normalize (expand) such relative s Distributing the crawler Communication between nodes Run multiple crawl threads, under different processes potentially at different nodes Output of the at each node is sent to the Eliminator of the appropriate node Geographically distributed nodes Partition hosts being crawled into nodes To other nodes Hash used for partition How do these nodes communicate and share s? 35 Frontier Host splitter From other nodes 36 6
7 frontier: two main considerations Politeness challenges Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails many links out of a page go to its own site, creating a burst of accesses to that site.) 37 Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host 38 frontier: Mercator scheme Mercator frontier s Prioritizer K front queues Biased front queue selector Back queue router B back queues Single host on each Back queue selector s flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO Crawl thread requesting Front queues Front queues Prioritizer 1 K Prioritizer assigns to an integer priority between 1 and K Appends to corresponding queue Biased front queue selector Back queue router 41 Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., crawl news sites more often ) 42 7
8 Biased front queue selector Back queues When a back queue requests a (in a sequence to be described): picks a front Biased front queue selector Back queue router queue from which to pull a This choice can be round robin biased to 1 B queues of higher priority, or some more sophisticated variant Can be randomized 43 Back queue selector Heap 44 Back queue invariants Back queue heap Each back queue is kept non-empty while the crawl is in progress Each back queue only contains s from a single host Maintain a table from hosts to back queues One entry for each back queue The entry is the earliest time at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose Back queue processing A crawler thread seeking a to crawl: Extracts the root of the heap es at head of corresponding back queue q (look up from table) Checks if queue q is now empty if so, pulls a v from front queues If there s already a back queue for v s host, append v to q and pull another from front queues, repeat Else add v to q When q is non-empty, create heap entry for it 47 Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads 48 8
9 licate documents Near licate ument Detection The web is full of duplicated content Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page 49 licate/near-licate Detection lication: Exact match can be detected with fingerprints (= Hash of content) Near-lication: Approximate match Compute syntactic similarity with an editdistance measure Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => uments are near duplicates Too Expensive!! Not transitive though sometimes used transitively Detecting approximate duplicates Create a sketch for each page that is much smaller than the page Extract a of features for each page Compare pages by comparing s of features In the following, assume: we have converted page into a sequence of tokens Eliminate punctuation, HTML markup, etc 52 Shingling Given document D, a w-shingle is a contiguous subsequence of w tokens The w-shingling S(D,w) of D, is the of all w- shingles in D Example: D=(a rose is a rose is a rose) a rose is a rose is a rose is a rose is a rose is a rose is a rose Resemblance The resemblance of docs A and B is defined S(D,4) = {(a,rose,is,a),(rose,is,a,rose), (is,a,rose,is)} as: In general, 0 r w (A,B) 1 Note r w (A,A) = 1 But r w (A,B)=1 does not mean A and B are identical! (What is a good value for w?) Remember this? 9
10 Sketches Comparing uments Set of all shingles is large Bigger than the original document We cannot compute resemblance! We create a document sketch by sampling only a few shingles Requirement Sketch resemblance should be a good estimate of document resemblance Notation: From now on, w will be a fixed value, so we omit w from r w and S(A,w) r(a,b)=r w (A,B) and S(A) = S(A,w) 55 Basic structure of document comparison process Note that sketches of documents that have already been seen are stored by crawler Shingle A Jaccard Shingle B 56 Choosing a sketch Random sampling does not work! Suppose we have identical documents A, B each with n shingles Let m A be a shingle from A, chosen uniformly at random; similarly m B For k=1: E[{m A } {m B } ] = 1/n But r(a,b) = 1 So the sketch overlap is an underestimate Choosing a sketch Assume the of all possible shingles is totally ordered Define: m A is the smallest shingle in A m B is the smallest shingle in B r (A,B) = {m A } {m B } For identical documents A & B r (A,B) = 1 = r(a,b) Problems With a sketch of size 1, there is still a high probability of error Especially if we choose minimum shingle, e.g., in alphabetical order Sketch of a document Create a sketch vector (of size ~200) for each document uments that share t (say 80%) corresponding vector elements are deemed near duplicates For doc D, sketchd[ i ] is as follows: Size of each shingle is still large e.g., if w = 7, about bytes 200 shingles = 8K-10K 59 Let f map all shingles in the universe to 0.. (e.g., f = fingerprinting) Let i be a random permutation on 0.. sketchd[i] = min { (f(s)) s shingle in D} 10
11 Computing Sketch[i] for 1 ument 1 Test if 1.Sketch[i] = 2.Sketch[i] ument 1 ument 2 Repeat 200 times! Compute shingles of 1 Take fingerprints f(shingles) Permute on the number line with i Pick the min value A B Are these equal? Test for 200 random permutations: 1, 2, 200 However ument 1 ument 2 A A = B iff the shingle with the MIN value in the union of 1 and 2 is common to both (i.e., lies in the intersection) Why? Claim: This happens with probability Size_of_intersection / Size_of_union B Set Similarity of s Ci, Cj View s as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of item i in j Example Ci C j Jaccard(C i,c j) C C C 1 C 2 i Jaccard(C 1,C 2 ) = 2/5 = j Key Observation Min Hashing For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim A Jaccard(C i,c j) A B C Randomly permute rows Hash h(ci) = index of first row with 1 in column Ci Surprising Property Why? P h(c ) h(c ) Jaccard C, C i Both are A/(A+B+C) Look down columns Ci, Cj until first non-type-d row h(ci) = h(cj) iff type A row j i j 11
12 Similar uments Remember: We test with 200 random permutations So, probability that D1 and D2 are declared similar is Jaccard(S(D1),S(D2)) 200 Computing Random Permutations Compute 200 different fingerprint(s) for each shingle For example, use MD5 hashing function over shingle%i, where 1<=i<=200 For each fingerprint function, choose lexicographically first shingle in each document Near duplication likelihood is computed by checking how many of the 200 smallest shingles are common to both documents Solution (cont) Observe that this gives us: An efficient way to compute random orders of shingles Significant reduction in size of shingles stored (40 bits is usually enough to keep estimates reasonably accurate) 69 12
WEB SEARCH BASICS, CRAWLING AND INDEXING. Slides by Manning, Raghavan, Schutze
WEB SEARCH BASICS, CRAWLING AND INDEXING 1 Brief (non technical) history Early keyword based engines ca. 1995 1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid search ranking: Goto (morphed into Overture.com
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More information20 Web crawling and indexes
DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 443 20 Web crawling and indexes 20.1 Overview WEB CRAWLER SPIDER Web crawling is the process by which we gather pages from the Web, in
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
More informationCommunicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.
Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.txt file format A component of the ACAP Technical Framework Implementation
More informationWebsite Audit Reports
Website Audit Reports Here are our Website Audit Reports Packages designed to help your business succeed further. Hover over the question marks to get a quick description. You may also download this as
More informationWebsite Standards Association. Business Website Search Engine Optimization
Website Standards Association Business Website Search Engine Optimization Copyright 2008 Website Standards Association Page 1 1. FOREWORD...3 2. PURPOSE AND SCOPE...4 2.1. PURPOSE...4 2.2. SCOPE...4 2.3.
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationComputer Networks - CS132/EECS148 - Spring 2013 ------------------------------------------------------------------------------
Computer Networks - CS132/EECS148 - Spring 2013 Instructor: Karim El Defrawy Assignment 2 Deadline : April 25 th 9:30pm (hard and soft copies required) ------------------------------------------------------------------------------
More informationContent Filtering Client Policy & Reporting Administrator s Guide
Content Filtering Client Policy & Reporting Administrator s Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your system. CAUTION: A CAUTION
More informationSRC Technical Note. Syntactic Clustering of the Web
SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301
More informationLecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at
Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How
More informationCluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.
Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one
More informationINTERNET DOMAIN NAME SYSTEM
INTERNET DOMAIN NAME SYSTEM http://www.tutorialspoint.com/internet_technologies/internet_domain_name_system.htm Copyright tutorialspoint.com Overview When DNS was not into existence, one had to download
More informationYandex: Webmaster Tools Overview and Guidelines
Yandex: Webmaster Tools Overview and Guidelines Agenda Introduction Register Features and Tools 2 Introduction What is Yandex Yandex is the leading search engine in Russia. It has nearly 60% market share
More information1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationIRLbot: Scaling to 6 Billion Pages and Beyond
IRLbot: Scaling to 6 Billion Pages and Beyond Presented by Xiaoming Wang Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov Internet Research Lab Computer Science Department Texas A&M University
More information80+ Things Every Marketer Needs to Know About Their Website
80+ Things Every Marketer Needs to Know About Their Website A Marketer s Guide to Improving Website Performance No website can avoid clutter building up over time. Website clutter buildup can negatively
More informationSearch Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development
Narkevičius. Search engine optimization. 3 Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development Robertas Narkevičius a Vilnius Business College, Kalvariju street 125,
More informationWeb Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
More informationDr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)
HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India anuangra@yahoo.com http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University
More informationSearch Engine Optimization (SEO): Improving Website Ranking
Search Engine Optimization (SEO): Improving Website Ranking Chandrani Nath #1, Dr. Laxmi Ahuja *2 # 1 *2 Amity University, Noida Abstract: - As web popularity increases day by day, millions of people use
More informationNaming vs. Locating Entities
Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address
More informationCommunications and Networking
Communications and Networking History and Background telephone system local area networks Internet architecture: what the pieces are and how they fit together names and addresses: what's your name and
More informationScalable Internet Services and Load Balancing
Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and
More informationSecure Web Application Coding Team Introductory Meeting December 1, 2005 1:00 2:00PM Bits & Pieces Room, Sansom West Room 306 Agenda
Secure Web Application Coding Team Introductory Meeting December 1, 2005 1:00 2:00PM Bits & Pieces Room, Sansom West Room 306 Agenda 1. Introductions for new members (5 minutes) 2. Name of group 3. Current
More informationCoveo Platform 7.0. Oracle Knowledge Connector Guide
Coveo Platform 7.0 Oracle Knowledge Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing
More informationOverlay Networks. Slides adopted from Prof. Böszörményi, Distributed Systems, Summer 2004.
Overlay Networks An overlay is a logical network on top of the physical network Routing Overlays The simplest kind of overlay Virtual Private Networks (VPN), supported by the routers If no router support
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationFrontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com
Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality
More informationOptimizing Web Servers Using Page Rank Prefetching for Clustered Accesses.
Optimizing Web Servers Using Page Rank Prefetching for Clustered Accesses. by VICTOR Y SAFRONOV A thesis submitted to the Graduate School-New Brunswick Rutgers, The State University of New Jersey in partial
More informationSEO WEBSITE MIGRATION CHECKLIST
SEO WEBSITE MIGRATION CHECKLIST A COMPREHENSIVE GUIDE FOR A SUCCESSFUL WEBSITE RE-LAUNCH A SITE MIGRATION IS A SIGNIFICANT PROJECT FROM A SEARCH MARKETING PERSPECTIVE AND AS SUCH SHOULD BE CAREFULLY CONSIDERED
More information30 Website Audit Report. 6 Website Audit Report. 18 Website Audit Report. 12 Website Audit Report. Package Name 3
TalkRite Communications, LLC Keene, NH (603) 499-4600 Winchendon, MA (978) 213-4200 info@talkrite.com Website Audit Report TRC Website Audit Report Packages are designed to help your business succeed further.
More informationEE 7376: Introduction to Computer Networks. Homework #3: Network Security, Email, Web, DNS, and Network Management. Maximum Points: 60
EE 7376: Introduction to Computer Networks Homework #3: Network Security, Email, Web, DNS, and Network Management Maximum Points: 60 1. Network security attacks that have to do with eavesdropping on, or
More informationConcepts of digital forensics
Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence
More informationBaidu: Webmaster Tools Overview and Guidelines
Baidu: Webmaster Tools Overview and Guidelines Agenda Introduction Register Data Submission Domain Transfer Monitor Web Analytics Mobile 2 Introduction What is Baidu Baidu is the leading search engine
More informationWebSphere Commerce V7 Feature Pack 3
WebSphere Commerce V7 Feature Pack 3 Search engine optimization 2011 IBM Corporation This presentation provides an overview of the search engine optimization enhancements in WebSphere Commerce Version
More informationCleaning Encrypted Traffic
Optenet Documentation Cleaning Encrypted Traffic Troubleshooting Guide iii Version History Doc Version Product Date Summary of Changes V6 OST-6.4.300 01/02/2015 English editing Optenet Documentation
More informationLife of a Packet CS 640, 2015-01-22
Life of a Packet CS 640, 2015-01-22 Outline Recap: building blocks Application to application communication Process to process communication Host to host communication Announcements Syllabus Should have
More informationWEB DEVELOPMENT IA & IB (893 & 894)
DESCRIPTION Web Development is a course designed to guide students in a project-based environment in the development of up-to-date concepts and skills that are used in the development of today s websites.
More informationIntroduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.
Guest lecturer: David Hovemeyer November 15, 2004 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds
More informationInternet Protocol Address
SFWR 4C03: Computer Networks & Computer Security Jan 17-21, 2005 Lecturer: Kartik Krishnan Lecture 7-9 Internet Protocol Address Addressing is a critical component of the internet abstraction. To give
More informationDifferences at a glance
Differences at a glance In the past, you might ve used the consumer (such as Microsoft Office 2013) version of Microsoft Excel outside of work. Now that you re using Google Apps for Work, you ll find many
More informationThe Ultimate Guide to Magento SEO Part 1: Basic website setup
The Ultimate Guide to Magento SEO Part 1: Basic website setup Jason Millward http://www.jasonmillward.com jason@jasonmillward.com Published November 2014 All rights reserved. No part of this publication
More informationBrief (non-technical) history
Sanda Harabagiu Lecture 10: Web search basics Brief (non-technical) history Early keyword-based engines ca. 1995-1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid searchranking: Goto (morphed into
More informationScalable Internet Services and Load Balancing
Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and
More informationD. SamKnows Methodology 20 Each deployed Whitebox performs the following tests: Primary measure(s)
v. Test Node Selection Having a geographically diverse set of test nodes would be of little use if the Whiteboxes running the test did not have a suitable mechanism to determine which node was the best
More informationarchitecture: what the pieces are and how they fit together names and addresses: what's your name and number?
Communications and networking history and background telephone system local area networks Internet architecture: what the pieces are and how they fit together names and addresses: what's your name and
More informationMark E. Pruzansky MD. Local SEO Action Plan for. About your Local SEO Action Plan. Technical SEO. 301 Redirects. XML Sitemap. Robots.
Local SEO Action Plan for Mark E. Pruzansky MD Action Plan generated on 5 May 2013 About your Local SEO Action Plan This report contains a number of recommendations for correcting the issues and taking
More informationYou can probably work with decimal. binary numbers needed by the. Working with binary numbers is time- consuming & error-prone.
IP Addressing & Subnetting Made Easy Working with IP Addresses Introduction You can probably work with decimal numbers much easier than with the binary numbers needed by the computer. Working with binary
More informationUnderstanding Slow Start
Chapter 1 Load Balancing 57 Understanding Slow Start When you configure a NetScaler to use a metric-based LB method such as Least Connections, Least Response Time, Least Bandwidth, Least Packets, or Custom
More information(Refer Slide Time: 02:17)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #06 IP Subnetting and Addressing (Not audible: (00:46)) Now,
More informationDNS Basics. DNS Basics
DNS Basics 1 A quick introduction to the Domain Name System (DNS). Shows the basic purpose of DNS, hierarchy of domain names, and an example of how the DNS protocol is used. There are many details of DNS
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationQlik REST Connector Installation and User Guide
Qlik REST Connector Installation and User Guide Qlik REST Connector Version 1.0 Newton, Massachusetts, November 2015 Authored by QlikTech International AB Copyright QlikTech International AB 2015, All
More informationFile Management. Chapter 12
Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution
More informationCOMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012
Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about
More informationHTTP. Internet Engineering. Fall 2015. Bahador Bakhshi CE & IT Department, Amirkabir University of Technology
HTTP Internet Engineering Fall 2015 Bahador Bakhshi CE & IT Department, Amirkabir University of Technology Questions Q1) How do web server and client browser talk to each other? Q1.1) What is the common
More informationOptimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
More informationSEO Tutorial PDF for Beginners
CONTENT Page 1. SEO Tutorial 1: SEO Introduction... 2 2. SEO Tutorial 2: On-Page Optimization. 3-4 3. SEO Tutorial 3: On-Page Optimization. 5-6 4. SEO Tutorial 3.1: Directory Submission List. 7-16 5. SEO
More informationGPFS and Remote Shell
GPFS and Remote Shell Yuri Volobuev GPFS Development Ver. 1.1, January 2015. Abstract The use of a remote shell command (e.g. ssh) by GPFS is one of the most frequently misunderstood aspects of GPFS administration,
More informationSimple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)
Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address
More information2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
More informationChapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
More informationIndex. AdWords, 182 AJAX Cart, 129 Attribution, 174
Index A AdWords, 182 AJAX Cart, 129 Attribution, 174 B BigQuery, Big Data Analysis create reports, 238 GA-BigQuery integration, 238 GA data, 241 hierarchy structure, 238 query language (see also Data selection,
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
More information2 TCP-like Design. Answer
Homework 3 1 DNS Suppose you have a Host C, a local name server L, and authoritative name servers A root, A com, and A google.com, where the naming convention A x means that the name server knows about
More informationI. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet
I. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet Seminar in Distributed Computing Louis Woods / 14.11.2007 Intermediaries NATs (NAPTs), firewalls and other
More informationLightweight DNS for Multipurpose and Multifunctional Devices
IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.12, December 2013 71 Lightweight DNS for Multipurpose and Multifunctional Devices Yogesh A. Wagh 1, Prashant A. Dhangar
More informationHow To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations
orrelog SQL Table Monitor Adapter Users Manual http://www.correlog.com mailto:info@correlog.com CorreLog, SQL Table Monitor Users Manual Copyright 2008-2015, CorreLog, Inc. All rights reserved. No part
More informationCrawl Proxy Installation and Configuration Guide
Crawl Proxy Installation and Configuration Guide Google Enterprise EMEA Google Search Appliance is able to natively crawl secure content coming from multiple sources using for instance the following main
More informationSEO BEST PRACTICES IPRODUCTION. SEO Best Practices. November 2013 V3. 1 iproduction
IPRODUCTION SEO Best Practices November 2013 V3 1 iproduction Contents 1 Introduction... 4 2 Sitemap and Robots.txt... 4 2.1 What is a Sitemap?... 4 2.2 HTML Sitemaps... 4 2.3 XML Sitemaps... 5 2.4 Robots.txt
More informationThe Domain Name System (DNS) Jason Hermance Nerces Kazandjian Long-Quan Nguyen
The Domain Name System (DNS) Jason Hermance Nerces Kazandjian Long-Quan Nguyen Introduction Machines find 32-bit IP addresses just peachy. Some Computer Science majors don t seem to mind either Normal
More informationLecture 15. IP address space managed by Internet Assigned Numbers Authority (IANA)
Lecture 15 IP Address Each host and router on the Internet has an IP address, which consist of a combination of network number and host number. The combination is unique; no two machines have the same
More informationGeo Targeting Server location, country-targeting, language declarations & hreflang
SEO Audit Checklist - TECHNICAL - Accessibility & Crawling Indexing DNS Make sure your domain name server is configured properly 404s Proper header responses and minimal reported errors Redirects Use 301s
More informationWebapps Vulnerability Report
Tuesday, May 1, 2012 Webapps Vulnerability Report Introduction This report provides detailed information of every vulnerability that was found and successfully exploited by CORE Impact Professional during
More informationUNMASKCONTENT: THE CASE STUDY
DIGITONTO LLC. UNMASKCONTENT: THE CASE STUDY The mystery UnmaskContent.com v1.0 Contents I. CASE 1: Malware Alert... 2 a. Scenario... 2 b. Data Collection... 2 c. Data Aggregation... 3 d. Data Enumeration...
More informationIntegrity Checking and Monitoring of Files on the CASTOR Disk Servers
Integrity Checking and Monitoring of Files on the CASTOR Disk Servers Author: Hallgeir Lien CERN openlab 17/8/2011 Contents CONTENTS 1 Introduction 4 1.1 Background...........................................
More informationwww.apacheviewer.com Apache Logs Viewer Manual
Apache Logs Viewer Manual Table of Contents 1. Introduction... 3 2. Installation... 3 3. Using Apache Logs Viewer... 4 3.1 Log Files... 4 3.1.1 Open Access Log File... 5 3.1.2 Open Remote Access Log File
More informationSecurity vulnerabilities in the Internet and possible solutions
Security vulnerabilities in the Internet and possible solutions 1. Introduction The foundation of today's Internet is the TCP/IP protocol suite. Since the time when these specifications were finished in
More informationLARGE-SCALE INTERNET MEASUREMENTS FOR DIAGNOSTICS AND PUBLIC POLICY. Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University
1 LARGE-SCALE INTERNET MEASUREMENTS FOR DIAGNOSTICS AND PUBLIC POLICY Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University 2 Overview Quick overview What does MBA measure? Can
More informationInternet Content Distribution
Internet Content Distribution Chapter 4: Content Distribution Networks (TUD Student Use Only) Chapter Outline Basics of content distribution networks (CDN) Why CDN? How do they work? Client redirection
More informationPartitioning under the hood in MySQL 5.5
Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael Ronström, Partitioning author Who are we? Mikael is a founder of the technology behind NDB
More informationTroubleshooting / FAQ
Troubleshooting / FAQ Routers / Firewalls I can't connect to my server from outside of my internal network. The server's IP is 10.0.1.23, but I can't use that IP from a friend's computer. How do I get
More informationExam 1 Review Questions
CSE 473 Introduction to Computer Networks Exam 1 Review Questions Jon Turner 10/2013 1. A user in St. Louis, connected to the internet via a 20 Mb/s (b=bits) connection retrieves a 250 KB (B=bytes) web
More informationVISA SECURITY ALERT December 2015 KUHOOK POINT OF SALE MALWARE. Summary. Distribution and Installation
VISA SECURITY ALERT December 2015 KUHOOK POINT OF SALE MALWARE Distribution: Merchants, Acquirers Who should read this: Information security, incident response, cyber intelligence staff Summary Kuhook
More informationPigeonhole Principle Solutions
Pigeonhole Principle Solutions 1. Show that if we take n + 1 numbers from the set {1, 2,..., 2n}, then some pair of numbers will have no factors in common. Solution: Note that consecutive numbers (such
More informationChapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
More informationPizza SEO: Effective Web. Effective Web Audit. Effective Web Audit. Copyright 2007+ Pizza SEO Ltd. info@pizzaseo.com http://pizzaseo.
1 Table of Contents 1 (X)HTML Code / CSS Code 1.1 Valid code 1.2 Layout 1.3 CSS & JavaScript 1.4 TITLE element 1.5 META Description element 1.6 Structure of pages 2 Structure of URL addresses 2.1 Friendly
More informationSitemap. Component for Joomla! This manual documents version 3.15.x of the Joomla! extension. http://www.aimy-extensions.com/joomla/sitemap.
Sitemap Component for Joomla! This manual documents version 3.15.x of the Joomla! extension. http://www.aimy-extensions.com/joomla/sitemap.html Contents 1 Introduction 3 2 Sitemap Features 3 3 Technical
More informationkalmstrom.com Business Solutions
HelpDesk OSP User Manual Content 1 INTRODUCTION... 3 2 REQUIREMENTS... 4 3 THE SHAREPOINT SITE... 4 4 THE HELPDESK OSP TICKET... 5 5 INSTALLATION OF HELPDESK OSP... 7 5.1 INTRODUCTION... 7 5.2 PROCESS...
More informationFiber Channel Over Ethernet (FCoE)
Fiber Channel Over Ethernet (FCoE) Using Intel Ethernet Switch Family White Paper November, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR
More informationCS514: Intermediate Course in Computer Systems
: Intermediate Course in Computer Systems Lecture 7: Sept. 19, 2003 Load Balancing Options Sources Lots of graphics and product description courtesy F5 website (www.f5.com) I believe F5 is market leader
More informationAN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING
International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012 AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING Md. Faizan
More informationPRODUCTION PLANNING AND SCHEDULING Part 1
PRODUCTION PLANNING AND SCHEDULING Part Andrew Kusiak 9 Seamans Center Iowa City, Iowa - 7 Tel: 9-9 Fax: 9-669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Forecasting Planning Hierarchy
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationWEBSITE PENETRATION VIA SEARCH
WEBSITE PENETRATION VIA SEARCH Azam Zia Muhammad Ayaz Email: azazi022@student.liu.se, muhay664@student.liu.se Supervisor: Juha Takkinen, juhta@ida.liu.se Project Report for Information Security Course
More informationCoveo Platform 7.0. Microsoft Dynamics CRM Connector Guide
Coveo Platform 7.0 Microsoft Dynamics CRM Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing
More information