Basic crawler operation. Crawler (AKA Spider) (AKA Robot) (AKA Bot) Basic Crawler. Downloading Web Pages. Crawling picture

Size: px
Start display at page:

Download "Basic crawler operation. Crawler (AKA Spider) (AKA Robot) (AKA Bot) Basic Crawler. Downloading Web Pages. Crawling picture"

Transcription

1 Basic crawler operation Sec Crawler (AKA Spider) (AKA Robot) (AKA Bot) Begin with known seed s and parse them Extract s they point to Place the extracted s on a queue each on the queue and repeat Some of these slides are based on Stanford IR Course slides at Basic Crawler Downloading Web Pages Frontier Will this terminate? remove( ) getpage( ) Suppose our robot wants to download: getlinksfrompage( ) 3 4 HTTP Request Web Server HTTP Response Webdata/index.html File System Got it! Now, lets parse it and continue on Actually more complicated; Server replies with redirection 5 Web s crawled and parsed Seed pages Crawling picture s frontier Unseen Web Sec

2 Structure of the Web Simple picture complications Sec What does this mean for seed choice? 7 Web crawling isn t feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters stipulations How deep should you crawl a site s hierarchy? Site mirrors and duplicate pages Politeness don t hit a server too often 8 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Sec What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Sec Respect.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources 9 10 What any crawler should do Sec Updated crawling picture Sec pages of higher quality first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols 11 s crawled and parsed Seed Pages Crawling thread frontier Unseen Web 12 2

3 Processing steps in crawling Basic crawl architecture Pick a from the frontier the document at the the Extract links from it to other docs (s) Check if has content already seen If not, add to indexes For each extracted Ensure it passes certain tests Which one? E.g., only crawl.edu, obey.txt, etc. Check if it is already in the frontier (duplicate ination) 13 Frontier 14 Basic crawl architecture frontier Sec Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Frontier Basic crawl architecture (Domain Name Server) Sec A lookup service on the internet Given a, retrieve its IP address Service provided by a distributed of servers thus, lookup latencies can be high (even seconds) Common OS implementations of lookup are blocking: only one outstanding request at a time Solutions caching Frontier Batch resolver collects requests and sends them 17 out together 18 3

4 Basic crawl architecture lication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles Second part of this lecture Frontier Basic crawl architecture Explicit and implicit politeness Sec Explicit politeness: specifications from webmasters on what portions of site can be crawled.txt Implicit politeness: even with no specification, avoid hitting any site too often Frontier Robots.txt Robots Exclusion Protocol Protocol for giving spiders ( ) limited access to a website, originally from Website announces its request on what can(not) be crawled For a server, create a file /.txt This file specifies access restrictions Put the file.txt at the root directory of the server Crawlers are expected to follow these instructions

5 .txt Format A.txt file consists of several records Each record consists of a of some crawler id s and a of s these crawlers are not allowed to visit User-agent lines: Names of crawlers Disallowed lines: Which s are not to be visited by these crawlers (agents)? User-agent: * Disallow: / User-agent: * Disallow: User-agent: Google Disallow: User-agent: * Disallow: / Examples User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ User-agent: BadBot Disallow: / Robots Meta Tag Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name, inside the HTML file Format: <meta name="" content="options"/> Options: index or noindex: index or do not index this file follow or nofollow: follow or do not follow the links of this file 27 <html> <head> <meta name="" content="noindex,follow"> <title>...</title> </head> <body> An Example: How should a crawler act when it visits this page? 28 Revisit Meta Tag Stronger Restrictions Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example: <meta name="revisit-after" content="10 days" /> 29 It is possible for a (non-polite) crawler to ignore the restrictions imposed by.txt and meta directions Therefore, if one wants to ensure that automatic do not visit her resources, she has to use other mechanisms For example, password protections 30 5

6 Interesting Questions Basic crawl architecture Can you be sued for copyright infringement if you don t follow the.txt protocol? Can you be sued for copyright infringement if you follow the.txt protocol? Can the site owner determine who is actually crawling their site? Can the site owner protect against unauthorized access by specific? Frontier Parsing: normalization licate ination When a fetched document is parsed, some of the extracted links are relative s E.g., has a relative link to /wiki/wikipedia:general_disclaimer which is the same as the absolute aimer For a non-continuous (one-shot) crawl, test to see if an extracted+ed has already been passed to the frontier For a continuous crawl see details of frontier implementation During parsing, must normalize (expand) such relative s Distributing the crawler Communication between nodes Run multiple crawl threads, under different processes potentially at different nodes Output of the at each node is sent to the Eliminator of the appropriate node Geographically distributed nodes Partition hosts being crawled into nodes To other nodes Hash used for partition How do these nodes communicate and share s? 35 Frontier Host splitter From other nodes 36 6

7 frontier: two main considerations Politeness challenges Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails many links out of a page go to its own site, creating a burst of accesses to that site.) 37 Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host 38 frontier: Mercator scheme Mercator frontier s Prioritizer K front queues Biased front queue selector Back queue router B back queues Single host on each Back queue selector s flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO Crawl thread requesting Front queues Front queues Prioritizer 1 K Prioritizer assigns to an integer priority between 1 and K Appends to corresponding queue Biased front queue selector Back queue router 41 Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., crawl news sites more often ) 42 7

8 Biased front queue selector Back queues When a back queue requests a (in a sequence to be described): picks a front Biased front queue selector Back queue router queue from which to pull a This choice can be round robin biased to 1 B queues of higher priority, or some more sophisticated variant Can be randomized 43 Back queue selector Heap 44 Back queue invariants Back queue heap Each back queue is kept non-empty while the crawl is in progress Each back queue only contains s from a single host Maintain a table from hosts to back queues One entry for each back queue The entry is the earliest time at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose Back queue processing A crawler thread seeking a to crawl: Extracts the root of the heap es at head of corresponding back queue q (look up from table) Checks if queue q is now empty if so, pulls a v from front queues If there s already a back queue for v s host, append v to q and pull another from front queues, repeat Else add v to q When q is non-empty, create heap entry for it 47 Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads 48 8

9 licate documents Near licate ument Detection The web is full of duplicated content Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page 49 licate/near-licate Detection lication: Exact match can be detected with fingerprints (= Hash of content) Near-lication: Approximate match Compute syntactic similarity with an editdistance measure Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => uments are near duplicates Too Expensive!! Not transitive though sometimes used transitively Detecting approximate duplicates Create a sketch for each page that is much smaller than the page Extract a of features for each page Compare pages by comparing s of features In the following, assume: we have converted page into a sequence of tokens Eliminate punctuation, HTML markup, etc 52 Shingling Given document D, a w-shingle is a contiguous subsequence of w tokens The w-shingling S(D,w) of D, is the of all w- shingles in D Example: D=(a rose is a rose is a rose) a rose is a rose is a rose is a rose is a rose is a rose is a rose Resemblance The resemblance of docs A and B is defined S(D,4) = {(a,rose,is,a),(rose,is,a,rose), (is,a,rose,is)} as: In general, 0 r w (A,B) 1 Note r w (A,A) = 1 But r w (A,B)=1 does not mean A and B are identical! (What is a good value for w?) Remember this? 9

10 Sketches Comparing uments Set of all shingles is large Bigger than the original document We cannot compute resemblance! We create a document sketch by sampling only a few shingles Requirement Sketch resemblance should be a good estimate of document resemblance Notation: From now on, w will be a fixed value, so we omit w from r w and S(A,w) r(a,b)=r w (A,B) and S(A) = S(A,w) 55 Basic structure of document comparison process Note that sketches of documents that have already been seen are stored by crawler Shingle A Jaccard Shingle B 56 Choosing a sketch Random sampling does not work! Suppose we have identical documents A, B each with n shingles Let m A be a shingle from A, chosen uniformly at random; similarly m B For k=1: E[{m A } {m B } ] = 1/n But r(a,b) = 1 So the sketch overlap is an underestimate Choosing a sketch Assume the of all possible shingles is totally ordered Define: m A is the smallest shingle in A m B is the smallest shingle in B r (A,B) = {m A } {m B } For identical documents A & B r (A,B) = 1 = r(a,b) Problems With a sketch of size 1, there is still a high probability of error Especially if we choose minimum shingle, e.g., in alphabetical order Sketch of a document Create a sketch vector (of size ~200) for each document uments that share t (say 80%) corresponding vector elements are deemed near duplicates For doc D, sketchd[ i ] is as follows: Size of each shingle is still large e.g., if w = 7, about bytes 200 shingles = 8K-10K 59 Let f map all shingles in the universe to 0.. (e.g., f = fingerprinting) Let i be a random permutation on 0.. sketchd[i] = min { (f(s)) s shingle in D} 10

11 Computing Sketch[i] for 1 ument 1 Test if 1.Sketch[i] = 2.Sketch[i] ument 1 ument 2 Repeat 200 times! Compute shingles of 1 Take fingerprints f(shingles) Permute on the number line with i Pick the min value A B Are these equal? Test for 200 random permutations: 1, 2, 200 However ument 1 ument 2 A A = B iff the shingle with the MIN value in the union of 1 and 2 is common to both (i.e., lies in the intersection) Why? Claim: This happens with probability Size_of_intersection / Size_of_union B Set Similarity of s Ci, Cj View s as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of item i in j Example Ci C j Jaccard(C i,c j) C C C 1 C 2 i Jaccard(C 1,C 2 ) = 2/5 = j Key Observation Min Hashing For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim A Jaccard(C i,c j) A B C Randomly permute rows Hash h(ci) = index of first row with 1 in column Ci Surprising Property Why? P h(c ) h(c ) Jaccard C, C i Both are A/(A+B+C) Look down columns Ci, Cj until first non-type-d row h(ci) = h(cj) iff type A row j i j 11

12 Similar uments Remember: We test with 200 random permutations So, probability that D1 and D2 are declared similar is Jaccard(S(D1),S(D2)) 200 Computing Random Permutations Compute 200 different fingerprint(s) for each shingle For example, use MD5 hashing function over shingle%i, where 1<=i<=200 For each fingerprint function, choose lexicographically first shingle in each document Near duplication likelihood is computed by checking how many of the 200 smallest shingles are common to both documents Solution (cont) Observe that this gives us: An efficient way to compute random orders of shingles Significant reduction in size of shingles stored (40 bits is usually enough to keep estimates reasonably accurate) 69 12

WEB SEARCH BASICS, CRAWLING AND INDEXING. Slides by Manning, Raghavan, Schutze

WEB SEARCH BASICS, CRAWLING AND INDEXING. Slides by Manning, Raghavan, Schutze WEB SEARCH BASICS, CRAWLING AND INDEXING 1 Brief (non technical) history Early keyword based engines ca. 1995 1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid search ranking: Goto (morphed into Overture.com

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

20 Web crawling and indexes

20 Web crawling and indexes DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 443 20 Web crawling and indexes 20.1 Overview WEB CRAWLER SPIDER Web crawling is the process by which we gather pages from the Web, in

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.

Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots. Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.txt file format A component of the ACAP Technical Framework Implementation

More information

Website Audit Reports

Website Audit Reports Website Audit Reports Here are our Website Audit Reports Packages designed to help your business succeed further. Hover over the question marks to get a quick description. You may also download this as

More information

Website Standards Association. Business Website Search Engine Optimization

Website Standards Association. Business Website Search Engine Optimization Website Standards Association Business Website Search Engine Optimization Copyright 2008 Website Standards Association Page 1 1. FOREWORD...3 2. PURPOSE AND SCOPE...4 2.1. PURPOSE...4 2.2. SCOPE...4 2.3.

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Computer Networks - CS132/EECS148 - Spring 2013 ------------------------------------------------------------------------------

Computer Networks - CS132/EECS148 - Spring 2013 ------------------------------------------------------------------------------ Computer Networks - CS132/EECS148 - Spring 2013 Instructor: Karim El Defrawy Assignment 2 Deadline : April 25 th 9:30pm (hard and soft copies required) ------------------------------------------------------------------------------

More information

Content Filtering Client Policy & Reporting Administrator s Guide

Content Filtering Client Policy & Reporting Administrator s Guide Content Filtering Client Policy & Reporting Administrator s Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your system. CAUTION: A CAUTION

More information

SRC Technical Note. Syntactic Clustering of the Web

SRC Technical Note. Syntactic Clustering of the Web SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301

More information

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

INTERNET DOMAIN NAME SYSTEM

INTERNET DOMAIN NAME SYSTEM INTERNET DOMAIN NAME SYSTEM http://www.tutorialspoint.com/internet_technologies/internet_domain_name_system.htm Copyright tutorialspoint.com Overview When DNS was not into existence, one had to download

More information

Yandex: Webmaster Tools Overview and Guidelines

Yandex: Webmaster Tools Overview and Guidelines Yandex: Webmaster Tools Overview and Guidelines Agenda Introduction Register Features and Tools 2 Introduction What is Yandex Yandex is the leading search engine in Russia. It has nearly 60% market share

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

IRLbot: Scaling to 6 Billion Pages and Beyond

IRLbot: Scaling to 6 Billion Pages and Beyond IRLbot: Scaling to 6 Billion Pages and Beyond Presented by Xiaoming Wang Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov Internet Research Lab Computer Science Department Texas A&M University

More information

80+ Things Every Marketer Needs to Know About Their Website

80+ Things Every Marketer Needs to Know About Their Website 80+ Things Every Marketer Needs to Know About Their Website A Marketer s Guide to Improving Website Performance No website can avoid clutter building up over time. Website clutter buildup can negatively

More information

Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development

Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development Narkevičius. Search engine optimization. 3 Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development Robertas Narkevičius a Vilnius Business College, Kalvariju street 125,

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE) HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India anuangra@yahoo.com http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University

More information

Search Engine Optimization (SEO): Improving Website Ranking

Search Engine Optimization (SEO): Improving Website Ranking Search Engine Optimization (SEO): Improving Website Ranking Chandrani Nath #1, Dr. Laxmi Ahuja *2 # 1 *2 Amity University, Noida Abstract: - As web popularity increases day by day, millions of people use

More information

Naming vs. Locating Entities

Naming vs. Locating Entities Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address

More information

Communications and Networking

Communications and Networking Communications and Networking History and Background telephone system local area networks Internet architecture: what the pieces are and how they fit together names and addresses: what's your name and

More information

Scalable Internet Services and Load Balancing

Scalable Internet Services and Load Balancing Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and

More information

Secure Web Application Coding Team Introductory Meeting December 1, 2005 1:00 2:00PM Bits & Pieces Room, Sansom West Room 306 Agenda

Secure Web Application Coding Team Introductory Meeting December 1, 2005 1:00 2:00PM Bits & Pieces Room, Sansom West Room 306 Agenda Secure Web Application Coding Team Introductory Meeting December 1, 2005 1:00 2:00PM Bits & Pieces Room, Sansom West Room 306 Agenda 1. Introductions for new members (5 minutes) 2. Name of group 3. Current

More information

Coveo Platform 7.0. Oracle Knowledge Connector Guide

Coveo Platform 7.0. Oracle Knowledge Connector Guide Coveo Platform 7.0 Oracle Knowledge Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing

More information

Overlay Networks. Slides adopted from Prof. Böszörményi, Distributed Systems, Summer 2004.

Overlay Networks. Slides adopted from Prof. Böszörményi, Distributed Systems, Summer 2004. Overlay Networks An overlay is a logical network on top of the physical network Routing Overlays The simplest kind of overlay Virtual Private Networks (VPN), supported by the routers If no router support

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality

More information

Optimizing Web Servers Using Page Rank Prefetching for Clustered Accesses.

Optimizing Web Servers Using Page Rank Prefetching for Clustered Accesses. Optimizing Web Servers Using Page Rank Prefetching for Clustered Accesses. by VICTOR Y SAFRONOV A thesis submitted to the Graduate School-New Brunswick Rutgers, The State University of New Jersey in partial

More information

SEO WEBSITE MIGRATION CHECKLIST

SEO WEBSITE MIGRATION CHECKLIST SEO WEBSITE MIGRATION CHECKLIST A COMPREHENSIVE GUIDE FOR A SUCCESSFUL WEBSITE RE-LAUNCH A SITE MIGRATION IS A SIGNIFICANT PROJECT FROM A SEARCH MARKETING PERSPECTIVE AND AS SUCH SHOULD BE CAREFULLY CONSIDERED

More information

30 Website Audit Report. 6 Website Audit Report. 18 Website Audit Report. 12 Website Audit Report. Package Name 3

30 Website Audit Report. 6 Website Audit Report. 18 Website Audit Report. 12 Website Audit Report. Package Name 3 TalkRite Communications, LLC Keene, NH (603) 499-4600 Winchendon, MA (978) 213-4200 info@talkrite.com Website Audit Report TRC Website Audit Report Packages are designed to help your business succeed further.

More information

EE 7376: Introduction to Computer Networks. Homework #3: Network Security, Email, Web, DNS, and Network Management. Maximum Points: 60

EE 7376: Introduction to Computer Networks. Homework #3: Network Security, Email, Web, DNS, and Network Management. Maximum Points: 60 EE 7376: Introduction to Computer Networks Homework #3: Network Security, Email, Web, DNS, and Network Management Maximum Points: 60 1. Network security attacks that have to do with eavesdropping on, or

More information

Concepts of digital forensics

Concepts of digital forensics Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

More information

Baidu: Webmaster Tools Overview and Guidelines

Baidu: Webmaster Tools Overview and Guidelines Baidu: Webmaster Tools Overview and Guidelines Agenda Introduction Register Data Submission Domain Transfer Monitor Web Analytics Mobile 2 Introduction What is Baidu Baidu is the leading search engine

More information

WebSphere Commerce V7 Feature Pack 3

WebSphere Commerce V7 Feature Pack 3 WebSphere Commerce V7 Feature Pack 3 Search engine optimization 2011 IBM Corporation This presentation provides an overview of the search engine optimization enhancements in WebSphere Commerce Version

More information

Cleaning Encrypted Traffic

Cleaning Encrypted Traffic Optenet Documentation Cleaning Encrypted Traffic Troubleshooting Guide iii Version History Doc Version Product Date Summary of Changes V6 OST-6.4.300 01/02/2015 English editing Optenet Documentation

More information

Life of a Packet CS 640, 2015-01-22

Life of a Packet CS 640, 2015-01-22 Life of a Packet CS 640, 2015-01-22 Outline Recap: building blocks Application to application communication Process to process communication Host to host communication Announcements Syllabus Should have

More information

WEB DEVELOPMENT IA & IB (893 & 894)

WEB DEVELOPMENT IA & IB (893 & 894) DESCRIPTION Web Development is a course designed to guide students in a project-based environment in the development of up-to-date concepts and skills that are used in the development of today s websites.

More information

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer. Guest lecturer: David Hovemeyer November 15, 2004 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds

More information

Internet Protocol Address

Internet Protocol Address SFWR 4C03: Computer Networks & Computer Security Jan 17-21, 2005 Lecturer: Kartik Krishnan Lecture 7-9 Internet Protocol Address Addressing is a critical component of the internet abstraction. To give

More information

Differences at a glance

Differences at a glance Differences at a glance In the past, you might ve used the consumer (such as Microsoft Office 2013) version of Microsoft Excel outside of work. Now that you re using Google Apps for Work, you ll find many

More information

The Ultimate Guide to Magento SEO Part 1: Basic website setup

The Ultimate Guide to Magento SEO Part 1: Basic website setup The Ultimate Guide to Magento SEO Part 1: Basic website setup Jason Millward http://www.jasonmillward.com jason@jasonmillward.com Published November 2014 All rights reserved. No part of this publication

More information

Brief (non-technical) history

Brief (non-technical) history Sanda Harabagiu Lecture 10: Web search basics Brief (non-technical) history Early keyword-based engines ca. 1995-1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid searchranking: Goto (morphed into

More information

Scalable Internet Services and Load Balancing

Scalable Internet Services and Load Balancing Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and

More information

D. SamKnows Methodology 20 Each deployed Whitebox performs the following tests: Primary measure(s)

D. SamKnows Methodology 20 Each deployed Whitebox performs the following tests: Primary measure(s) v. Test Node Selection Having a geographically diverse set of test nodes would be of little use if the Whiteboxes running the test did not have a suitable mechanism to determine which node was the best

More information

architecture: what the pieces are and how they fit together names and addresses: what's your name and number?

architecture: what the pieces are and how they fit together names and addresses: what's your name and number? Communications and networking history and background telephone system local area networks Internet architecture: what the pieces are and how they fit together names and addresses: what's your name and

More information

Mark E. Pruzansky MD. Local SEO Action Plan for. About your Local SEO Action Plan. Technical SEO. 301 Redirects. XML Sitemap. Robots.

Mark E. Pruzansky MD. Local SEO Action Plan for. About your Local SEO Action Plan. Technical SEO. 301 Redirects. XML Sitemap. Robots. Local SEO Action Plan for Mark E. Pruzansky MD Action Plan generated on 5 May 2013 About your Local SEO Action Plan This report contains a number of recommendations for correcting the issues and taking

More information

You can probably work with decimal. binary numbers needed by the. Working with binary numbers is time- consuming & error-prone.

You can probably work with decimal. binary numbers needed by the. Working with binary numbers is time- consuming & error-prone. IP Addressing & Subnetting Made Easy Working with IP Addresses Introduction You can probably work with decimal numbers much easier than with the binary numbers needed by the computer. Working with binary

More information

Understanding Slow Start

Understanding Slow Start Chapter 1 Load Balancing 57 Understanding Slow Start When you configure a NetScaler to use a metric-based LB method such as Least Connections, Least Response Time, Least Bandwidth, Least Packets, or Custom

More information

(Refer Slide Time: 02:17)

(Refer Slide Time: 02:17) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #06 IP Subnetting and Addressing (Not audible: (00:46)) Now,

More information

DNS Basics. DNS Basics

DNS Basics. DNS Basics DNS Basics 1 A quick introduction to the Domain Name System (DNS). Shows the basic purpose of DNS, hierarchy of domain names, and an example of how the DNS protocol is used. There are many details of DNS

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Qlik REST Connector Installation and User Guide

Qlik REST Connector Installation and User Guide Qlik REST Connector Installation and User Guide Qlik REST Connector Version 1.0 Newton, Massachusetts, November 2015 Authored by QlikTech International AB Copyright QlikTech International AB 2015, All

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012 Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about

More information

HTTP. Internet Engineering. Fall 2015. Bahador Bakhshi CE & IT Department, Amirkabir University of Technology

HTTP. Internet Engineering. Fall 2015. Bahador Bakhshi CE & IT Department, Amirkabir University of Technology HTTP Internet Engineering Fall 2015 Bahador Bakhshi CE & IT Department, Amirkabir University of Technology Questions Q1) How do web server and client browser talk to each other? Q1.1) What is the common

More information

Optimization of Distributed Crawler under Hadoop

Optimization of Distributed Crawler under Hadoop MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*

More information

SEO Tutorial PDF for Beginners

SEO Tutorial PDF for Beginners CONTENT Page 1. SEO Tutorial 1: SEO Introduction... 2 2. SEO Tutorial 2: On-Page Optimization. 3-4 3. SEO Tutorial 3: On-Page Optimization. 5-6 4. SEO Tutorial 3.1: Directory Submission List. 7-16 5. SEO

More information

GPFS and Remote Shell

GPFS and Remote Shell GPFS and Remote Shell Yuri Volobuev GPFS Development Ver. 1.1, January 2015. Abstract The use of a remote shell command (e.g. ssh) by GPFS is one of the most frequently misunderstood aspects of GPFS administration,

More information

Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)

Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1) Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address

More information

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system 1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

More information

Chapter 12 File Management. Roadmap

Chapter 12 File Management. Roadmap Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

More information

Index. AdWords, 182 AJAX Cart, 129 Attribution, 174

Index. AdWords, 182 AJAX Cart, 129 Attribution, 174 Index A AdWords, 182 AJAX Cart, 129 Attribution, 174 B BigQuery, Big Data Analysis create reports, 238 GA-BigQuery integration, 238 GA data, 241 hierarchy structure, 238 query language (see also Data selection,

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

More information

2 TCP-like Design. Answer

2 TCP-like Design. Answer Homework 3 1 DNS Suppose you have a Host C, a local name server L, and authoritative name servers A root, A com, and A google.com, where the naming convention A x means that the name server knows about

More information

I. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet

I. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet I. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet Seminar in Distributed Computing Louis Woods / 14.11.2007 Intermediaries NATs (NAPTs), firewalls and other

More information

Lightweight DNS for Multipurpose and Multifunctional Devices

Lightweight DNS for Multipurpose and Multifunctional Devices IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.12, December 2013 71 Lightweight DNS for Multipurpose and Multifunctional Devices Yogesh A. Wagh 1, Prashant A. Dhangar

More information

How To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations

How To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations orrelog SQL Table Monitor Adapter Users Manual http://www.correlog.com mailto:info@correlog.com CorreLog, SQL Table Monitor Users Manual Copyright 2008-2015, CorreLog, Inc. All rights reserved. No part

More information

Crawl Proxy Installation and Configuration Guide

Crawl Proxy Installation and Configuration Guide Crawl Proxy Installation and Configuration Guide Google Enterprise EMEA Google Search Appliance is able to natively crawl secure content coming from multiple sources using for instance the following main

More information

SEO BEST PRACTICES IPRODUCTION. SEO Best Practices. November 2013 V3. 1 iproduction

SEO BEST PRACTICES IPRODUCTION. SEO Best Practices. November 2013 V3. 1 iproduction IPRODUCTION SEO Best Practices November 2013 V3 1 iproduction Contents 1 Introduction... 4 2 Sitemap and Robots.txt... 4 2.1 What is a Sitemap?... 4 2.2 HTML Sitemaps... 4 2.3 XML Sitemaps... 5 2.4 Robots.txt

More information

The Domain Name System (DNS) Jason Hermance Nerces Kazandjian Long-Quan Nguyen

The Domain Name System (DNS) Jason Hermance Nerces Kazandjian Long-Quan Nguyen The Domain Name System (DNS) Jason Hermance Nerces Kazandjian Long-Quan Nguyen Introduction Machines find 32-bit IP addresses just peachy. Some Computer Science majors don t seem to mind either Normal

More information

Lecture 15. IP address space managed by Internet Assigned Numbers Authority (IANA)

Lecture 15. IP address space managed by Internet Assigned Numbers Authority (IANA) Lecture 15 IP Address Each host and router on the Internet has an IP address, which consist of a combination of network number and host number. The combination is unique; no two machines have the same

More information

Geo Targeting Server location, country-targeting, language declarations & hreflang

Geo Targeting Server location, country-targeting, language declarations & hreflang SEO Audit Checklist - TECHNICAL - Accessibility & Crawling Indexing DNS Make sure your domain name server is configured properly 404s Proper header responses and minimal reported errors Redirects Use 301s

More information

Webapps Vulnerability Report

Webapps Vulnerability Report Tuesday, May 1, 2012 Webapps Vulnerability Report Introduction This report provides detailed information of every vulnerability that was found and successfully exploited by CORE Impact Professional during

More information

UNMASKCONTENT: THE CASE STUDY

UNMASKCONTENT: THE CASE STUDY DIGITONTO LLC. UNMASKCONTENT: THE CASE STUDY The mystery UnmaskContent.com v1.0 Contents I. CASE 1: Malware Alert... 2 a. Scenario... 2 b. Data Collection... 2 c. Data Aggregation... 3 d. Data Enumeration...

More information

Integrity Checking and Monitoring of Files on the CASTOR Disk Servers

Integrity Checking and Monitoring of Files on the CASTOR Disk Servers Integrity Checking and Monitoring of Files on the CASTOR Disk Servers Author: Hallgeir Lien CERN openlab 17/8/2011 Contents CONTENTS 1 Introduction 4 1.1 Background...........................................

More information

www.apacheviewer.com Apache Logs Viewer Manual

www.apacheviewer.com Apache Logs Viewer Manual Apache Logs Viewer Manual Table of Contents 1. Introduction... 3 2. Installation... 3 3. Using Apache Logs Viewer... 4 3.1 Log Files... 4 3.1.1 Open Access Log File... 5 3.1.2 Open Remote Access Log File

More information

Security vulnerabilities in the Internet and possible solutions

Security vulnerabilities in the Internet and possible solutions Security vulnerabilities in the Internet and possible solutions 1. Introduction The foundation of today's Internet is the TCP/IP protocol suite. Since the time when these specifications were finished in

More information

LARGE-SCALE INTERNET MEASUREMENTS FOR DIAGNOSTICS AND PUBLIC POLICY. Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University

LARGE-SCALE INTERNET MEASUREMENTS FOR DIAGNOSTICS AND PUBLIC POLICY. Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University 1 LARGE-SCALE INTERNET MEASUREMENTS FOR DIAGNOSTICS AND PUBLIC POLICY Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University 2 Overview Quick overview What does MBA measure? Can

More information

Internet Content Distribution

Internet Content Distribution Internet Content Distribution Chapter 4: Content Distribution Networks (TUD Student Use Only) Chapter Outline Basics of content distribution networks (CDN) Why CDN? How do they work? Client redirection

More information

Partitioning under the hood in MySQL 5.5

Partitioning under the hood in MySQL 5.5 Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael Ronström, Partitioning author Who are we? Mikael is a founder of the technology behind NDB

More information

Troubleshooting / FAQ

Troubleshooting / FAQ Troubleshooting / FAQ Routers / Firewalls I can't connect to my server from outside of my internal network. The server's IP is 10.0.1.23, but I can't use that IP from a friend's computer. How do I get

More information

Exam 1 Review Questions

Exam 1 Review Questions CSE 473 Introduction to Computer Networks Exam 1 Review Questions Jon Turner 10/2013 1. A user in St. Louis, connected to the internet via a 20 Mb/s (b=bits) connection retrieves a 250 KB (B=bytes) web

More information

VISA SECURITY ALERT December 2015 KUHOOK POINT OF SALE MALWARE. Summary. Distribution and Installation

VISA SECURITY ALERT December 2015 KUHOOK POINT OF SALE MALWARE. Summary. Distribution and Installation VISA SECURITY ALERT December 2015 KUHOOK POINT OF SALE MALWARE Distribution: Merchants, Acquirers Who should read this: Information security, incident response, cyber intelligence staff Summary Kuhook

More information

Pigeonhole Principle Solutions

Pigeonhole Principle Solutions Pigeonhole Principle Solutions 1. Show that if we take n + 1 numbers from the set {1, 2,..., 2n}, then some pair of numbers will have no factors in common. Solution: Note that consecutive numbers (such

More information

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet

More information

Pizza SEO: Effective Web. Effective Web Audit. Effective Web Audit. Copyright 2007+ Pizza SEO Ltd. info@pizzaseo.com http://pizzaseo.

Pizza SEO: Effective Web. Effective Web Audit. Effective Web Audit. Copyright 2007+ Pizza SEO Ltd. info@pizzaseo.com http://pizzaseo. 1 Table of Contents 1 (X)HTML Code / CSS Code 1.1 Valid code 1.2 Layout 1.3 CSS & JavaScript 1.4 TITLE element 1.5 META Description element 1.6 Structure of pages 2 Structure of URL addresses 2.1 Friendly

More information

Sitemap. Component for Joomla! This manual documents version 3.15.x of the Joomla! extension. http://www.aimy-extensions.com/joomla/sitemap.

Sitemap. Component for Joomla! This manual documents version 3.15.x of the Joomla! extension. http://www.aimy-extensions.com/joomla/sitemap. Sitemap Component for Joomla! This manual documents version 3.15.x of the Joomla! extension. http://www.aimy-extensions.com/joomla/sitemap.html Contents 1 Introduction 3 2 Sitemap Features 3 3 Technical

More information

kalmstrom.com Business Solutions

kalmstrom.com Business Solutions HelpDesk OSP User Manual Content 1 INTRODUCTION... 3 2 REQUIREMENTS... 4 3 THE SHAREPOINT SITE... 4 4 THE HELPDESK OSP TICKET... 5 5 INSTALLATION OF HELPDESK OSP... 7 5.1 INTRODUCTION... 7 5.2 PROCESS...

More information

Fiber Channel Over Ethernet (FCoE)

Fiber Channel Over Ethernet (FCoE) Fiber Channel Over Ethernet (FCoE) Using Intel Ethernet Switch Family White Paper November, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR

More information

CS514: Intermediate Course in Computer Systems

CS514: Intermediate Course in Computer Systems : Intermediate Course in Computer Systems Lecture 7: Sept. 19, 2003 Load Balancing Options Sources Lots of graphics and product description courtesy F5 website (www.f5.com) I believe F5 is market leader

More information

AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING

AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012 AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING Md. Faizan

More information

PRODUCTION PLANNING AND SCHEDULING Part 1

PRODUCTION PLANNING AND SCHEDULING Part 1 PRODUCTION PLANNING AND SCHEDULING Part Andrew Kusiak 9 Seamans Center Iowa City, Iowa - 7 Tel: 9-9 Fax: 9-669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Forecasting Planning Hierarchy

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

WEBSITE PENETRATION VIA SEARCH

WEBSITE PENETRATION VIA SEARCH WEBSITE PENETRATION VIA SEARCH Azam Zia Muhammad Ayaz Email: azazi022@student.liu.se, muhay664@student.liu.se Supervisor: Juha Takkinen, juhta@ida.liu.se Project Report for Information Security Course

More information

Coveo Platform 7.0. Microsoft Dynamics CRM Connector Guide

Coveo Platform 7.0. Microsoft Dynamics CRM Connector Guide Coveo Platform 7.0 Microsoft Dynamics CRM Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing

More information