Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Similar documents
Hadoop: Embracing future hardware

Benchmarking Hadoop & HBase on Violin

HiBench Introduction. Carson Wang Software & Services Group

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Hadoop & Spark Using Amazon EMR

Application Performance Testing Basics

Planning Domain Controller Capacity

Performance and scalability of a large OLTP workload

Amazon EC2 Product Details Page 1 of 5

Big Fast Data Hadoop acceleration with Flash. June 2013

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Parallels Plesk Panel

MAGENTO HOSTING Progressive Server Performance Improvements

CS 188/219. Scalable Internet Services Andrew Mutz October 8, 2015

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

GraySort on Apache Spark by Databricks

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

A Talk ForApacheCon Europe 2008

MySQL: Cloud vs Bare Metal, Performance and Reliability

DataStax Enterprise, powered by Apache Cassandra (TM)

Hadoop Scheduler w i t h Deadline Constraint

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Benchmarking Cassandra on Violin

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Optimization of Distributed Crawler under Hadoop

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

SSD Performance Tips: Avoid The Write Cliff

Bigtable is a proven design Underpins 100+ Google services:

3/21/2011. Topics. What is load balancing? Load Balancing

NoSQL Data Base Basics

Open Source DBMS CUBRID 2008 & Community Activities. Byung Joo Chung bjchung@cubrid.com

Migration Scenario: Migrating Batch Processes to the AWS Cloud

IRLbot: Scaling to 6 Billion Pages and Beyond

Investigating Hadoop for Large Spatiotemporal Processing Tasks

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Big Data With Hadoop

The VMware Reference Architecture for Stateless Virtual Desktops with VMware View 4.5

Development of nosql data storage for the ATLAS PanDA Monitoring System

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Parallels Plesk Panel

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Topics. 1. What is load balancing? 2. Load balancing techniques 3. Load balancing strategies 4. Sessions 5. Elastic load balancing

Flash Performance for Oracle RAC with PCIe Shared Storage A Revolutionary Oracle RAC Architecture

How To Choose Between A Relational Database Service From Aws.Com

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Hadoop. Sunday, November 25, 12

Moving From Hadoop to Spark

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

World Leading Application Delivery Controllers. Peter Draper Technical Director EMEA

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Cloudera Manager Training: Hands-On Exercises

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

Hadoop Ecosystem B Y R A H I M A.

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

White paper. QNAP Turbo NAS with SSD Cache

Implement Hadoop jobs to extract business value from large and varied data sets

Mark Bennett. Search and the Virtual Machine

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Efficient Network Marketing - Fabien Hermenier A.M.a.a.a.C.

Application-Focused Flash Acceleration

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

The Methodology Behind the Dell SQL Server Advisor Tool

Apache HBase. Crazy dances on the elephant back

Amazon Cloud Storage Options

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

BASICS OF SCALING: LOAD BALANCERS

CDH installation & Application Test Report

Scalable Architecture on Amazon AWS Cloud

Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

CiteSeer x in the Cloud

Managing Orion Performance

HAWQ Architecture. Alexey Grishchenko

Testing & Assuring Mobile End User Experience Before Production. Neotys

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Hadoop and Map-Reduce. Swati Gore

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Monitoring Pramati Web Server

On a Hadoop-based Analytics Service System

Transcription:

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA search, snippets. 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2

Task Crawl Spanish web to gather statistics about hosts and their sizes. Limit crawl to.es zone. Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on, Finishing condition: absence of hosts with less than 100 crawled documents. Low costs. 3

Spanish internet (.es) in 2012 Domain names registered - 1,56М (39% growth per year) Web server in zone - 283,4K (33,1%) Hosts - 4,2M (21%) Spanish web sites in DMOZ catalog - 22043 * - отчет OECD Communications Outlook 2013 4

Solution Scrapy* - network operations. Apache Kafka - data bus (offsets, partitioning). Apache HBase - storage (random access, linear scanning, scalability). Twisted.Internet - library for async primitives for use in workers. Snappy - efficient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 5

Architecture Kafka topic SW DB Crawling strategy workers Storage workers 6

1. Big and small hosts When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host. That causes underuse of spider resources. We adopted additional perhost (optionally per-ip) queue and metering algorithm: URLs from big hosts are cached in memory. problem 7

3. DDoS DNS service Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request. Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. We used dnsmasq. Amazon AWS 8

4. Tuning Scrapy thread pool а for efficient DNS resolution Scrapy uses a thread pool to resolve DNS name to IP. When ip is absent in cache, request is sent to DNS server in it s own thread, which is blocking. Scrapy reported numerous errors related to DNS name resolution and timeouts. We added option to Scrapy for thread pool size and timeout adjustment. 9

5. Overloaded HBase region servers during state check Crawler extracts from document hundreds of links in average. Before adding this links to queue, they needs to be checked if they weren t already crawled (to avoid repetitive visiting). On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. Host-local fingerprint function for keys in HBase. Tuning HBase block cache to fit average host states into one block. 10

6. Intensive network traffic from workers to services We noticed throughput between workers Kafka and HBase up to 1Gbit/s. Switched to Thrift compact protocol for HBase communication. Message compression in Kafka using Snappy. 11

7. Further query and traffic optimizations to HBase State check required lion s share of requests and network throughput. Consistency was another requirement. We created local state cache in strategy worker. For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 12

State cache All operations are batched: If key is absent in cache, it s requested from HBase, every ~4K documents cache is flushed to HBase. When achieving 3M (~1Гб) elements, flush and cleanup happens. It seems Least-Recently-Used (LRU) algorithm is a good fit there.

Spider priority queue (slot) Cell has an array of: - fingerprint, - Crc32(hostname), - URL, - score Dequeueing top N. Such design is prone to huge hosts. Partially this problem can be solved using scoring model taking into account known document count per host. 14

8. Problem of big and small hosts (strikes back!) During crawling we ve found few very huge hosts (>20M docs) All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used. We made two MapReduce jobs: queue shuffling, limiting all hosts to no more than 100 documents. 15

Hardware requirements Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. Spiders to workers ratio is 4:1 (without content) 1 Gb of RAM for every SW (state cache, tunable). Example: 12 spiders ~ 14.4K pages/min., 3 SW and 3 DB workers, Total 18 cores.

Software requirements Apache HBase, Apache Kafka, Python 2.7+, CDH (100% Open source Hadoop package) Scrapy 0.24+, DNS Service. 17

Maintaining Cloudera Hadoop on Amazon EC2 CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. We ve moved it using symbolic links to separate EBS partition. EBS should be at least 30Gb, base IOPS should be enough. Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).

Spanish (.es) internet crawl results fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites 68.7K domains found (~600K expected), 46.5M crawled pages overall, 1.5 months, 22 websites with more than 50M pages

where are the rest of web servers?!

Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

Graph Structure in the Web Revisited, Meusel, Vigna, WWW 2014

Main features Online operation: scheduling of new batch, updating of DB state. Storage abstraction: write your own backend (sqlalchemy, HBase is included). Canonical URLs resolution abstraction: each document has many URLs, which to use? Scrapy ecosystem: good documentation, big community, ease of customization. 24

Main features Communication layer is Apache Kafka: topic partitioning, offsets mechanism. Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. Polite by design: each website is downloaded by at most one spider. Python: workers, spiders.

References Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera Frontera. https://github.com/scrapinghub/frontera Documentation: http://distributed-frontera.readthedocs.org/ http://frontera.readthedocs.org/ 26

Future plans Lighter version, without HBase and Kafka. Communicating using sockets. Revisiting strategy out-of-box. Watchdog solution: tracking website content changes. PageRank or HITS strategy. Own HTML and URL parsers. Integration into Scrapinghub services. Testing on larger volumes. 27

Contribute! Distributed Frontera is a historically first attempt to implement web scale web crawler using Python. Truly resource-intensive task: CPU, network, disks. Made in Scrapinghub, a company where Scrapy was created. A plans to become an Apache Software Foundation project. 28

We re hiring! http://scrapinghub.com/jobs/ 29

Köszönöm! Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com