Building a Peer-to-Peer, domain specific web crawler
|
|
|
- Maurice Jeffry Boyd
- 9 years ago
- Views:
Transcription
1 Building a Peer-to-Peer, domain specific web crawler Tushar Bansal cc.gatech.edu Ling Liu cc.gatech.edu ABSTRACT The introduction of a crawler in mid 9s opened the floodgates for research in various application domains. Many attempts to create an ideal crawler failed due to the explosive nature of the web. In this paper, we describe the building blocks of PeerCrawl - a Peer-to-Peer web crawler. This crawler can be used for generic crawling, is easily scalable and can be implemented on a grid of day-to-day use computers. Also, we demonstrate and implement a novel scheme for coordinating peers to follow a focused crawler. We cover the issues faced during the building of this crawler and decisions taken to overcome the same. Keywords Web Crawler/Spider, Peer-to-Peer, Bloom Filter, Threads 1. INTRODUCTION The web has expanded beyond its expectations to give rise to a chaotic monster of information in varied forms of media. Many systems heavily depend on retrieval of such information to various uses ranging from critical medical prescriptions to simply fun facts. A common thread amongst all these applications is a web crawler whose job is to gather data connected via hyperlinks. Since the anatomy of a search engine by Google[1], many attempts have been made to explain the composition and working of a crawler. As most of the centralized web crawlers suffer from obvious disadvantages, fine tuning their performance by adding more correlated components is really a short term fix. The advent of problems in a generic crawler has resulted in research on various types of web-crawlers. Sphinx[2] was one the first attempts towards a domain-specific web crawler. The definition of a domain has ranged from a web-domain, topic-specific focused crawling[3] to a type of document media (images, pdf, etc). Also there has been a lot of work on incorporating techniques from Artificial Intelligence[4,5] in order to improve quality of crawling. Finally, the introduction of distributed systems has produced a new breed of high performance crawlers. Vladislav, et al[6] and Bold, et al[8] give a detailed description of a distributed web crawler, while Cho[7] gives an overview of parallel crawlers. As web crawlers have been the point of interest for so many years, there is a buzz to generalize the architecture of a crawler. Mercator[1] did an excellent job in highlighting the problems faced during construction of a crawler while proposing a generic structure for a crawler. With the aid of Mercator[1] and other systems like [6,11], not only we have various models to build a crawler, but also the common problems faced. This paper focuses on using the advantages of Peer-to- Peer(P2P) systems in web crawling. The functionality of PeerCrawl can be extended from a scalable, genericcrawler to a web-domain specific crawler. It can thus be used to study the structure of a particular website and further be used to create source-specific page rank[9] for efficient crawling. P2P systems have demonstrated their scalability and versatile usage through numerous applications. The main advantage of P2P systems is that they can be implemented on a grid of normal day-to-day use computers. This proves as a lucrative idea for an exhaustive task like web crawling. The rest of the paper is divided into the following sections. In Section 2 we will look at the system architecture for one peer. In Section 3 we will study the coordination of various peers. In Section 4 we will try to point different issues involved with building a crawler. Finally we discuss implementation details & experimental results in Section SYSTEM ARCHITECTURE We design our crawler based on Mercator[1], which gives a good overview of the essential components of a crawler. It also justifies the presence of a component with different issues faced while creating a crawler, along with some performance measures. However, the main difference in this crawler is the presence of a network layer which is used to communicate between the different peers. Let us look at the architecture of PeerCrawl in more detail. Each peer in PeerCrawl is broken into many threads of control, which perform various tasks ranging from fetching a document for a web server to maintaining bookkeeping information. All these threads are autonomous, but they could share some data structures between other threads, thus leading to various synchronization issues. Note that all these threads and data structures are for one peer only. 2.1 Data Structures Firstly, let us discuss the different data structures that different threads could operate on. CrawlJobs : List of urls which are fetched and processed by different threads
2 Figure 1. Workflow of single node in PeerCrawl FetchBuffer : List of documents (web pages/pdf/doc/ etc..) that are processed and cached by some threads UrlInfo : Map of Bloom Filters for every domain, used to detect duplicate urls that are already crawled 2.2 Threads Now, let us look at the different types of threads that constitute the working engine of PeerCrawl. Fetch_Thread : The main purpose of this thread is to fetch a document from a web server. It takes a url from CrawlJobs and gets a document using a standard HTTP Connection. Process_Thread : This thread scans a document from FetchBuffer for urls using standard Pattern Recognition techniques for URLs. They also keep a track of duplicate urls using bloom filters stored in UrlInfo. Caching_Thread : Caches documents from FetchBuffer to secondary storage for later retrieval during user queries. The cached documents can be used to create an internet archive for specific domains. The threads mentioned above could be more than one depending on the processing power of that Peer. Later on we will show some results that give us a glimpse of the variation in number of threads affecting the performance. Statistics_Thread : Maintains all the bookkeeping information for ever peer. These include no of urls crawled, added and processed per second, various HTTP connection statistics along with the dropped & duplicated urls. Backup_Thread : Used to periodically backup critical data structures for recovery purpose. Network_Connection_Thread: This thread uses the P2P network layer to detect any peers entering or leaving the network. Dispatch_Thread : A Peer broadcasts any url that doesn t belong to its domain of crawl. The domain of a peer is dynamically updated depending on the number of peers in a network at any instance. More information regarding the interaction between different peers is detailed in Section Workflow After having an idea of the different types of data structures and threads, let us walk through the life-cycle of a peer in PeerCrawl. It starts with a seed list of urls which are typically diverse urls for giving a better start range for the crawler. ) A seed list of urls is added to CrawlJobs 1) Fetch_Thread picks up a block of urls 2) Fetch_Thread gets a url from web servers. It could use a local DNS cache to fasten connection setup to web server
3 3) If there are no HTTP connection errors or socket timeouts, Fetch_Thread puts document into FetchBuffer for processing 4) Process_Thread scans document for links to other urls. Urls are normalized and cleaned before passing on to the next filter. Also we limit the urls to a particular depth, so that we don t run into unnecessary links that loop (e.g page relatively referring to itself). 5) If the document is fully processed, Caching_Thread buffers the document to secondary storage 6) Process_Thread now checks for Robots Exclusion principle[12] and other rules for filtering out unnecessary urls 7) If a url is not mapped to a peer then it is broadcasted to other peers. Otherwise it is added to CrawlJobs to complete the cycle The Statistics_Thread, Network_Connection_Thread & Backup_Thread keep running in the background with different time intervals. 3. PEER COORDINATION So far we have seen the working of a single node in PeerCrawl along with its data structures and thread types. But the difference in PeerCrawl from its counterparts is the notion of a P2P network. We use an open-source Gnutella[13] as the underlying P2P network. Previous work in this project[14] describes the working of the P2P layer in more detail. In this paper, we would concentrate on the use of this layer to establish various forms of Peer coordination. During the first few iterations of the crawler, the primary motive was to crawl the entire web. The peers used a URL Distribution function as described in DSphere[15] which would dynamically decide the floor and ceiling of the crawl domain of a Peer. This was done by equally dividing the total range of IP Addresses amongst the peers. Although PeerCrawl was feasible for crawling the entire world wide web, the ingenuity lies in coordinating the peers to crawl a web-domain or a set of domains. We propose atleast two variations of doing this task where we assume that the user will give a set of domains to be crawled as a Task-List (different from seed list). One possible way of coordinating the peers is by assigning each peer with one domain from the Task-List. This would result in ideal performance if: No of. Peers <= Task-List If the no of peers are greater than the task-list, then some peers will sit idle. There is very less interaction between the peers as each of them crawl non-overlapping domains. The root node can be made as a center for getting uncrawled domains. Typically in real world scenarios, the Task-List could be small and the number of possible peers could be large. The best way to use the scalable nature of the P2P network is to convert the Task-List into a continuous domain, thus allowing mapping similar to one described in DSphere[15]. This will even ensure that domains with uneven sizes get divided uniformly. However, this will increase the communication between peers along with the rise of duplicate urls. There could be a hybrid version incorporating both the schemes mentioned above. However, the performance of these schemes could be largely influenced by the nature of the domain i.e. the size, connectivity, depth, etc. This field of research is rather unexplored and could probe novel schemes for achieving maximum performance gain. 4. ISSUES WITH CRAWLING One of the standard problems placed with any dataintensive system is managing overflows of data queues. This coupled with thread scheduling makes it a very interesting problem from the systems point of view. In any system, as the number of threads increase, the synchronization overhead becomes more prominent. Thus, there is always a classic tradeoff between synchronizing data structures and keeping a local copy of the same. We will try to justify a blend of both schemes. 4.1 Buffer Management Synchronization is needed for the CrawlJobs and FetchBuffer data structures, which traditionally operate as FIFO queues. Later on, CrawlJobs could operate as priority queues with enhanced schemes to determine the importance of urls (e.g. page ranking). In Section 5, we will examine the limits for each of these data structures with respect to various varying factors. Note that these data structures are also synchronized by Statistics_Thread and Backup_Thread. The Process_Thread adds a block of urls to the CrawlJobs queue which the Fetch_Thread picks up and gets the document from the web. Urls are added in a block due to urls found in a document. Thus, we allow the Fetch_Thread makes a local copy of the queue, which reduces synchronization overhead from constantly spinning on CrawlJobs. There is a limit (empirically determined) for CrawlJobs which results in a spawn of a new Fetch_Thread to handle the overflowing urls. Other design techniques include that of local queues per process and fetch thread in Mercator[1]. The FetchBuffer works quite differently from CrawlJobs. The threads synchronizing over FetchBuffer are: Fetch_Thread which adds documents, Process_Thread which gets and processes documents and Caching_Thread whose job is to buffer the processed pages to secondary storage. Since there is fixed ordering amongst these
4 threads, we keep one global copy of the data structure and maintain variables to manipulate the order. Similar to CrawlJobs, the FetchBuffer also has a limit beyond which a new Process_Thread is spawned to control the faster incoming rate of the data queue. 4.2 Thread Scheduling Many crawlers follow a custom scheduling for various threads in their system to improve performance. We argue that if the system is implemented in JAVA, the best way to improve performance is to leave thread scheduling to JAVA Virtual Machine (JVM). If we want to explicitly control the timings of some threads, (like Backup_Thread, Statistics_Thread in this system), we can use thread routines provided by JAVA. However, we should decide on the number of threads running on the Peer. We have tried various schemes and found the best performance is given by taking pairs of Fetch_Thread and Process_Thread apart from other combinations. 4.3 DNS Lookup According to Mercator[1], the maximum time lost during the life-cycle of a crawler is that during DNS Lookup. Although we have not done stress testing on this issue, we say that JAVA caches the DNS lookups internally which removes the need for a separate user level data structure for the same. 4.4 URL Starvation The universal problem with all crawlers is that it would run out of urls if left to venture with remote urls. The solutions floating around include improving the quality of seed list (which fails as it is very difficult to find a good quality seed list) to finding similar document using metacrawl techniques[16]. We adopt tunneling proposed by Bergmark[17] as an approach to venture out of the webdomain to find links pointing to pages within the domain. We control this by maintaining a hop count which can be empirically set. Search Engine Size of gatech.edu Google 1,24, Yahoo! Search 1,445,287 MSN Search 576,358 Ask.com 1,485, Table 1. No of urls according to various search engines If we look at the results according to Figure 2, the P2P crawler performs well, considering this is just the number of URLs crawled by a singleton node in the P2P network, and by starting with a single seed URL. A depth is defined as a sub-domain of a domain. We also observe that the number of urls increases exponentially with depth in the initial stages and then settles to a linear graph. This is true as the number of urls in the first few sub-domains is much more than in the deeper ones. URLs Crawled gatech.edu Depth Figure 2. Crawled vs Depth of domain The result according to Figure 3 gives us an example of the impact of the number of threads on the performance of any node. 4.5 Other Issues We follow the policies that a traditional web crawler should follow including Politeness Policy and Re-Visit Policy apart from others. We also follow the Robots Exclusion standard[12] that disallows the crawling of some sub-domains for different types of user agents. 5. EXPERIMENTS AND RESULTS In this paper we concentrate on performance issues of one peer in PeerCrawl. Since we have a lot of varying parameters, we can setup some interesting experiments to study the crawler. We run all our experiments on the domain of gatech.edu. First, let us look at some domain statistics from well known crawlers about gatech.edu. Time (sec) URLs 1 URLs No of Threads Figure 3. Time to crawl vs No of threads
5 As expected, the time taken decreases exponentially, until it reaches a stable point where it doesn t go any faster. The deviations in the graph indicate the tradeoff made in the handling of threads. Obviously the point of stabilization shifts more to the right (i.e. more threads are used to reach the minimum time to crawl) as we increase our crawl limit. Also, after a high number of threads, the time to crawl will increase due to the overhead of handling the excess threads. In the end let us examine the results for the size of different data buffers that the threads synchronize on. As expected, the size of CrawlJobs decreases as more Fetch_Thread are introduced. However, the size of FetchBuffer increases as the Fetch_Thread is faster than Process_Thread. Size FetchBuffer CrawlJobs No of Threads Figure 3. Size of FetchBuffer & CrawlJobs vs No of Threads 6. CONCLUSION We have demonstrated the working of a node in PeerCrawl along with the factors affecting its performance. We can expect better performance and more extensibility to the crawler as we use a number of peers in coordination. Although, we have to still fine tune the crawler, we can postulate its usage in many domains apart from the traditional indexing and searching. 7. REFERENCES [1] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998 [2] R. Miller and K. Bharat. SPHINX: A framework for creating personal, site-specific web crawlers. In Proceedings of the 7th World-Wide Web Conference (WWW7), [3] Chakrabharti, S., Van Den Berg, M., AND Dom, B Focused crawling: A new approach to topicspecific web resource discovery. In Proceedings of the Eighth International Conference on The World- Wide Web. [4] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, Building domain-specic search engines with machine learning techniques, in Proc. AAAI Spring Symposium on Intelligent Agents in Cyberspace, [5] J. Rennie and A. McCallum, Using reinforcement learning to spider the web efficiently, in Proc. International Conference on Machine Learning (ICML), [6] Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE), 22. [7] J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference, 22. [8] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: a scalable fully distributed Web crawler. Software, Practice and experience, 34(8): , 24. [9] James Caverlee and Ling Liu. Resisting Web Spam with Credibility Based Link Analysis [1] A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4): , [11] A. Singh, M. Srivatsava, L. Liu, and T. Miller. Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. Lecture Notes in Computer Science, 2924, 24. [12] Martijn Koster. The Robot Exclusion Standard. [13] Gnutella Network [14] V. J. Padliya and L. Liu. Peercrawl: A decentralized peer-to-peer architecture for crawling the world wide web. Technical report, Georgia Institute of Technology, May 26 [15] B Bamba, L. Liu, et al - DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web [16] Jialun Qin, Yilu Zhou, Michael Chau, Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 7-11, 24, Tuscon, AZ, USA [17] Bergmark, D., Lagoze, C. and Sbityakov, A. (22b). Focused Crawls, Tunneling, and Digital Libraries, in Proc.of the 6th European Conference on Digital Libraries, Rome, Italy
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING
International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012 AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN SPECIFIC AND INCREMENTAL CRAWLING Md. Faizan
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Framework for Intelligent Crawler Engine on IaaS Cloud Service Model
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1783-1789 International Research Publications House http://www. irphouse.com Framework for
Using Peer to Peer Dynamic Querying in Grid Information Services
Using Peer to Peer Dynamic Querying in Grid Information Services Domenico Talia and Paolo Trunfio DEIS University of Calabria HPC 2008 July 2, 2008 Cetraro, Italy Using P2P for Large scale Grid Information
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
Enhancing the Ranking of a Web Page in the Ocean of Data
Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India [email protected] In today
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching
2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.5 Object Request Reduction
Back-End Forwarding Scheme in Server Load Balancing using Client Virtualization
Back-End Forwarding Scheme in Server Load Balancing using Client Virtualization Shreyansh Kumar School of Computing Science and Engineering VIT University Chennai Campus Parvathi.R, Ph.D Associate Professor-
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Naming vs. Locating Entities
Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address
Make search become the internal function of Internet
Make search become the internal function of Internet Wang Liang 1, Guo Yi-Ping 2, Fang Ming 3 1, 3 (Department of Control Science and Control Engineer, Huazhong University of Science and Technology, WuHan,
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING Agustín Sabater, Carlos Guerrero, Isaac Lera, Carlos Juiz Computer Science Department, University of the Balearic Islands, SPAIN [email protected], [email protected],
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.
RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,
Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.
Client-Side Dynamic Web Page Generation CGI, PHP, JSP, and ASP scripts solve the problem of handling forms and interactions with databases on the server. They can all accept incoming information from forms,
SuperViz: An Interactive Visualization of Super-Peer P2P Network
SuperViz: An Interactive Visualization of Super-Peer P2P Network Anthony (Peiqun) Yu [email protected] Abstract: The Efficient Clustered Super-Peer P2P network is a novel P2P architecture, which overcomes
A Comparative Approach to Search Engine Ranking Strategies
26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
Building Scalable Applications Using Microsoft Technologies
Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,
Evolution of Peer-to-Peer Systems
EE 657 Lecture 9 on Sept. 28, 2007 Evolution of Peer-to-Peer Systems Peer-To-Peer Computing: Part 1 : P2P Platforms, Overlay Networks, and Gnutella Prof. kai Hwang University of Southern California Taylor
Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems
Mapping the Gnutella Network: Macroscopic Properties of Large-Scale Peer-to-Peer Systems Matei Ripeanu, Ian Foster {matei, foster}@cs.uchicago.edu Abstract Despite recent excitement generated by the peer-to-peer
Chapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications
Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications Rouven Kreb 1 and Manuel Loesch 2 1 SAP AG, Walldorf, Germany 2 FZI Research Center for Information
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
IRLbot: Scaling to 6 Billion Pages and Beyond
IRLbot: Scaling to 6 Billion Pages and Beyond Presented by Xiaoming Wang Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov Internet Research Lab Computer Science Department Texas A&M University
20 Web crawling and indexes
DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 443 20 Web crawling and indexes 20.1 Overview WEB CRAWLER SPIDER Web crawling is the process by which we gather pages from the Web, in
Corso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa [email protected] tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto
q for Gods Whitepaper Series (Edition 7) Common Design Principles for kdb+ Gateways
Series (Edition 7) Common Design Principles for kdb+ Gateways May 2013 Author: Michael McClintock joined First Derivatives in 2009 and has worked as a consultant on a range of kdb+ applications for hedge
Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 [email protected]
Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 [email protected] Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality
A Survey Study on Monitoring Service for Grid
A Survey Study on Monitoring Service for Grid Erkang You [email protected] ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide
From Centralization to Distribution: A Comparison of File Sharing Protocols
From Centralization to Distribution: A Comparison of File Sharing Protocols Xu Wang, Teng Long and Alan Sussman Department of Computer Science, University of Maryland, College Park, MD, 20742 August, 2015
Introduction to Computer Networks
Introduction to Computer Networks Chen Yu Indiana University Basic Building Blocks for Computer Networks Nodes PC, server, special-purpose hardware, sensors Switches Links: Twisted pair, coaxial cable,
Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.
Is your database application experiencing poor response time, scalability problems, and too many deadlocks or poor application performance? One or a combination of zparms, database design and application
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Improved Hybrid Dynamic Load Balancing Algorithm for Distributed Environment
International Journal of Scientific and Research Publications, Volume 3, Issue 3, March 2013 1 Improved Hybrid Dynamic Load Balancing Algorithm for Distributed Environment UrjashreePatil*, RajashreeShedge**
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing
International Journal of Computational Engineering Research Vol, 03 Issue, 10 XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing Kamlesh Lakhwani 1, Ruchika Saini 1 1 (Dept. of Computer
Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam
A Survey on P2P File Sharing Systems Using Proximity-aware interest Clustering Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam
Dynamic Thread Pool based Service Tracking Manager
Dynamic Thread Pool based Service Tracking Manager D.V.Lavanya, V.K.Govindan Department of Computer Science & Engineering National Institute of Technology Calicut Calicut, India e-mail: [email protected],
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Mercator: A Scalable, Extensible Web Crawler
Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork Compaq Systems Research Center 130 Lytton Ave. Palo Alto, CA 94301 {heydon,najork}@pa.dec.com Abstract This paper describes Mercator,
Spidering and Filtering Web Pages for Vertical Search Engines
Spidering and Filtering Web Pages for Vertical Search Engines Michael Chau The University of Arizona [email protected] 1 Introduction The size of the Web is growing exponentially. The number of indexable
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390
The Role and uses of Peer-to-Peer in file-sharing Computer Communication & Distributed Systems EDA 390 Jenny Bengtsson Prarthanaa Khokar [email protected] [email protected] Gothenburg, May
CS 558 Internet Systems and Technologies
CS 558 Internet Systems and Technologies Dimitris Deyannis [email protected] 881 Heat seeking Honeypots: Design and Experience Abstract Compromised Web servers are used to perform many malicious activities.
CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING
CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility
An Approach to Give First Rank for Website and Webpage Through SEO
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-2 Issue-6 E-ISSN: 2347-2693 An Approach to Give First Rank for Website and Webpage Through SEO Rajneesh Shrivastva
RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS
ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for
Copyright www.agileload.com 1
Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 36 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 36 An Efficient Approach for Load Balancing in Cloud Environment Balasundaram Ananthakrishnan Abstract Cloud computing
Classic Grid Architecture
Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes
IJREAS Volume 2, Issue 2 (February 2012) ISSN: 2249-3905 STUDY OF SEARCH ENGINE OPTIMIZATION ABSTRACT
STUDY OF SEARCH ENGINE OPTIMIZATION Sachin Gupta * Ankit Aggarwal * ABSTRACT Search Engine Optimization (SEO) is a technique that comes under internet marketing and plays a vital role in making sure that
Optimizing service availability in VoIP signaling networks, by decoupling query handling in an asynchronous RPC manner
Optimizing service availability in VoIP signaling networks, by decoupling query handling in an asynchronous RPC manner Voichiţa Almăşan and Iosif Ignat Technical University of Cluj-Napoca Computer Science
An Efficient Scheme to Remove Crawler Traffic from the Internet
An Efficient Scheme to Remove Crawler Traffic from the Internet X. Yuan, M. H. MacGregor, J. Harms Department of Computing Science University of Alberta Edmonton, Alberta, Canada Email: xiaoqin,macg,harms
Usage of OPNET IT tool to Simulate and Test the Security of Cloud under varying Firewall conditions
Usage of OPNET IT tool to Simulate and Test the Security of Cloud under varying Firewall conditions GRADUATE PROJECT REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences Texas
A Tool for Evaluation and Optimization of Web Application Performance
A Tool for Evaluation and Optimization of Web Application Performance Tomáš Černý 1 [email protected] Michael J. Donahoo 2 [email protected] Abstract: One of the main goals of web application
This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902
Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited
Multiagent Reputation Management to Achieve Robust Software Using Redundancy
Multiagent Reputation Management to Achieve Robust Software Using Redundancy Rajesh Turlapati and Michael N. Huhns Center for Information Technology, University of South Carolina Columbia, SC 29208 {turlapat,huhns}@engr.sc.edu
Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services
Cognos8 Deployment Best Practices for Performance/Scalability Barnaby Cole Practice Lead, Technical Services Agenda > Cognos 8 Architecture Overview > Cognos 8 Components > Load Balancing > Deployment
A Framework of User-Driven Data Analytics in the Cloud for Course Management
A Framework of User-Driven Data Analytics in the Cloud for Course Management Jie ZHANG 1, William Chandra TJHI 2, Bu Sung LEE 1, Kee Khoon LEE 2, Julita VASSILEVA 3 & Chee Kit LOOI 4 1 School of Computer
INCREASING THE CLOUD PERFORMANCE WITH LOCAL AUTHENTICATION
INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 INCREASING THE CLOUD PERFORMANCE WITH LOCAL AUTHENTICATION Sanjay Razdan Department of Computer Science and Eng. Mewar
Website Marketing Audit. Example, inc. Website Marketing Audit. For. Example, INC. Provided by
Website Marketing Audit For Example, INC Provided by State of your Website Strengths We found the website to be easy to navigate and does not contain any broken links. The structure of the website is clean
1 Organization of Operating Systems
COMP 730 (242) Class Notes Section 10: Organization of Operating Systems 1 Organization of Operating Systems We have studied in detail the organization of Xinu. Naturally, this organization is far from
Efficient Focused Web Crawling Approach for Search Engine
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545
A Comparison of Techniques to Find Mirrored Hosts on the WWW
A Comparison of Techniques to Find Mirrored Hosts on the WWW Krishna Bharat Google Inc. 2400 Bayshore Ave Mountain View, CA 94043 [email protected] Andrei Broder AltaVista Company 1825 S. Grant St. San
Patterns of Information Management
PATTERNS OF MANAGEMENT Patterns of Information Management Making the right choices for your organization s information Summary of Patterns Mandy Chessell and Harald Smith Copyright 2011, 2012 by Mandy
Original-page small file oriented EXT3 file storage system
Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: [email protected]
IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand.
IBM Global Technology Services September 2007 NAS systems scale out to meet Page 2 Contents 2 Introduction 2 Understanding the traditional NAS role 3 Gaining NAS benefits 4 NAS shortcomings in enterprise
SEO Techniques for various Applications - A Comparative Analyses and Evaluation
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya
Optimal Service Pricing for a Cloud Cache
Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,
ADAPTIVE LOAD BALANCING FOR CLUSTER USING CONTENT AWARENESS WITH TRAFFIC MONITORING Archana Nigam, Tejprakash Singh, Anuj Tiwari, Ankita Singhal
ADAPTIVE LOAD BALANCING FOR CLUSTER USING CONTENT AWARENESS WITH TRAFFIC MONITORING Archana Nigam, Tejprakash Singh, Anuj Tiwari, Ankita Singhal Abstract With the rapid growth of both information and users
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
4D Deployment Options for Wide Area Networks
4D Deployment Options for Wide Area Networks By Jason T. Slack, Technical Support Engineer, 4D Inc. Technical Note 07-32 Abstract 4 th Dimension is a highly flexible tool for creating and deploying powerful
Bisecting K-Means for Clustering Web Log data
Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
Remote Graphical Visualization of Large Interactive Spatial Data
Remote Graphical Visualization of Large Interactive Spatial Data ComplexHPC Spring School 2011 International ComplexHPC Challenge Cristinel Mihai Mocan Computer Science Department Technical University
Understanding Web personalization with Web Usage Mining and its Application: Recommender System
Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,
