Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA search, snippets. 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2

Task Crawl Spanish web to gather statistics about hosts and their sizes. Limit crawl to.es zone. Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on, Finishing condition: absence of hosts with less than 100 crawled documents. Low costs. 3

Spanish internet (.es) in 2012 Domain names registered - 1,56М (39% growth per year) Web server in zone - 283,4K (33,1%) Hosts - 4,2M (21%) Spanish web sites in DMOZ catalog - 22043 * - отчет OECD Communications Outlook 2013 4

Solution Scrapy* - network operations. Apache Kafka - data bus (offsets, partitioning). Apache HBase - storage (random access, linear scanning, scalability). Twisted.Internet - library for async primitives for use in workers. Snappy - efficient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 5

Architecture Kafka topic SW DB Crawling strategy workers Storage workers 6

1. Big and small hosts When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host. That causes underuse of spider resources. We adopted additional perhost (optionally per-ip) queue and metering algorithm: URLs from big hosts are cached in memory. problem 7

3. DDoS DNS service Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request. Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. We used dnsmasq. Amazon AWS 8

4. Tuning Scrapy thread pool а for efficient DNS resolution Scrapy uses a thread pool to resolve DNS name to IP. When ip is absent in cache, request is sent to DNS server in it s own thread, which is blocking. Scrapy reported numerous errors related to DNS name resolution and timeouts. We added option to Scrapy for thread pool size and timeout adjustment. 9

5. Overloaded HBase region servers during state check Crawler extracts from document hundreds of links in average. Before adding this links to queue, they needs to be checked if they weren t already crawled (to avoid repetitive visiting). On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. Host-local fingerprint function for keys in HBase. Tuning HBase block cache to fit average host states into one block. 10

6. Intensive network traffic from workers to services We noticed throughput between workers Kafka and HBase up to 1Gbit/s. Switched to Thrift compact protocol for HBase communication. Message compression in Kafka using Snappy. 11

7. Further query and traffic optimizations to HBase State check required lion s share of requests and network throughput. Consistency was another requirement. We created local state cache in strategy worker. For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 12

State cache All operations are batched: If key is absent in cache, it s requested from HBase, every ~4K documents cache is flushed to HBase. When achieving 3M (~1Гб) elements, flush and cleanup happens. It seems Least-Recently-Used (LRU) algorithm is a good fit there.

Spider priority queue (slot) Cell has an array of: - fingerprint, - Crc32(hostname), - URL, - score Dequeueing top N. Such design is prone to huge hosts. Partially this problem can be solved using scoring model taking into account known document count per host. 14

8. Problem of big and small hosts (strikes back!) During crawling we ve found few very huge hosts (>20M docs) All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used. We made two MapReduce jobs: queue shuffling, limiting all hosts to no more than 100 documents. 15

Hardware requirements Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. Spiders to workers ratio is 4:1 (without content) 1 Gb of RAM for every SW (state cache, tunable). Example: 12 spiders ~ 14.4K pages/min., 3 SW and 3 DB workers, Total 18 cores.

Software requirements Apache HBase, Apache Kafka, Python 2.7+, CDH (100% Open source Hadoop package) Scrapy 0.24+, DNS Service. 17

Maintaining Cloudera Hadoop on Amazon EC2 CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. We ve moved it using symbolic links to separate EBS partition. EBS should be at least 30Gb, base IOPS should be enough. Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).

Spanish (.es) internet crawl results fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites 68.7K domains found (~600K expected), 46.5M crawled pages overall, 1.5 months, 22 websites with more than 50M pages

where are the rest of web servers?!

Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

Graph Structure in the Web Revisited, Meusel, Vigna, WWW 2014

Main features Online operation: scheduling of new batch, updating of DB state. Storage abstraction: write your own backend (sqlalchemy, HBase is included). Canonical URLs resolution abstraction: each document has many URLs, which to use? Scrapy ecosystem: good documentation, big community, ease of customization. 24

Main features Communication layer is Apache Kafka: topic partitioning, offsets mechanism. Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. Polite by design: each website is downloaded by at most one spider. Python: workers, spiders.

References Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera Frontera. https://github.com/scrapinghub/frontera Documentation: http://distributed-frontera.readthedocs.org/ http://frontera.readthedocs.org/ 26

Future plans Lighter version, without HBase and Kafka. Communicating using sockets. Revisiting strategy out-of-box. Watchdog solution: tracking website content changes. PageRank or HITS strategy. Own HTML and URL parsers. Integration into Scrapinghub services. Testing on larger volumes. 27

Contribute! Distributed Frontera is a historically first attempt to implement web scale web crawler using Python. Truly resource-intensive task: CPU, network, disks. Made in Scrapinghub, a company where Scrapy was created. A plans to become an Apache Software Foundation project. 28

We re hiring! http://scrapinghub.com/jobs/ 29

Köszönöm! Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com