Storm Crawler. A real-time distributed web crawling and monitoring framework. Jake Dodd, co-founder

Similar documents
Web Crawling with Apache Nutch

Real-time Big Data Analytics with Storm

Architectures for massive data management

Introducing Storm 1 Core Storm concepts Topology design

Amazon Kinesis and Apache Storm

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data Web Analytics Platform on AWS for Yottaa

Hadoop & Spark Using Amazon EMR

Search and Real-Time Analytics on Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Jitterbit Technical Overview : Salesforce

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr

HiBench Introduction. Carson Wang Software & Services Group

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Information Retrieval Elasticsearch

Openbus Documentation

Jitterbit Technical Overview : Microsoft Dynamics AX

Wisdom from Crowds of Machines

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Log management with Logstash and Elasticsearch. Matteo Dessalvi

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Predictive Analytics with Storm, Hadoop, R on AWS

Manjrasoft Market Oriented Cloud Computing Platform

Microservices on AWS

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Sentry. Prasad Mujumdar

CiteSeer x in the Cloud

Big Data Analytics for Cyber

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

JBoss & Infinispan open source data grids for the cloud era

Building Multilingual Search Index using open source framework

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

NoSQL Roadshow Berlin Kai Spichale

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

HADOOP. Revised 10/19/2015

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Building a Scalable News Feed Web Service in Clojure

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Workshop on Hadoop with Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Getting Started with Hadoop with Amazon s Elastic MapReduce

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Analysis: Apache Storm Perspective

Completing the Big Data Ecosystem:

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Distributed Computing and Big Data: Hadoop and MapReduce

Large-Scale Web Applications

SOA, case Google. Faculty of technology management Information Technology Service Oriented Communications CT30A8901.

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

From Internet Data Centers to Data Centers in the Cloud

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

CSC BIG DATA TECHNICAL OVERVIEW PLATFORM AS A SERVICE GAIN BIG INSIGHT FROM BIG DATA EXPLORE THE TECHNOLOGY

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Chapter 5: Stream Processing. Big Data Management and Analytics 193

BIRT Document Transform

NOT IN KANSAS ANY MORE

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Apache HBase. Crazy dances on the elephant back

EVILSEED: A Guided Approach to Finding Malicious Web Pages

A programming model in Cloud: MapReduce

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Hadoop Ecosystem B Y R A H I M A.

the missing log collector Treasure Data, Inc. Muga Nishizawa

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

Manjrasoft Market Oriented Cloud Computing Platform

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Introduction to Hadoop

Apache Hadoop. Alexandru Costan

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Future Internet Technologies

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

SGT Technology Innovation Center Dasvis Project

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Mark Bennett. Search and the Virtual Machine

How To Scale Out Of A Nosql Database

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Transcription:

Storm Crawler A real-time distributed web crawling and monitoring framework Jake Dodd, co-founder http://ontopic.io jake@ontopic.io ApacheCon North America 2015 http://ontopic.io 1

Agenda Overview Continuous vs. Batch Storm-Crawler Components Integration Use Cases Demonstration Q&A http://ontopic.io 2

Storm-Crawler overview Software Development Kit (SDK) for building web crawlers on Apache Storm https://github.com/digitalpebble/storm-crawler Apache License v2 Project Director: Julien Nioche (DigitalPebble Ltd) + 3 committers http://ontopic.io 3

Facts overview Powered by the Apache Storm framework Real-time, distributed, continuous crawling Discovery to indexing with low latency Java API Available as a Maven dependency http://ontopic.io 4

The Old Way continuous vs. batch Batch-oriented crawling Generate a batch of URLs batch fetch à batch parse à batch index à rinse & repeat Benefits Well-suited when data locality is paramount Challenges Inefficient use of resources parsing when you could be fetching, hard to allocate and scale resources for individual tasks High latency at least several minutes, often hours, sometimes days between discovery and indexing http://ontopic.io 5

Continuous Crawl continuous vs. batch Treat crawling as a streaming problem Feed the machine with a stream of URLs, receive a stream of results ASAP URL à fetch à parse à (other stuff) à index Benefits Low latency discovery to indexing in mere moments Efficient use of resources always be fetching Able to allocate resources to tasks on-the-fly (e.g. scale fetchers while holding parsers constant) Easily support stateful features (sessions and more) Challenges URL queuing and scheduling http://ontopic.io 6

The Static Web continuous vs. batch The Old Model: the web as a collection of linked static documents Still a useful model just ask Google, Yahoo, Bing, and friends But the web has evolved dynamism is the rule, not the exception http://ontopic.io 7

The Web Stream continuous vs. batch Dynamic resources produce a stream of links to new documents Applies to web pages, feeds, and social media static new dynamic new static new http://ontopic.io 8

Can we do both? continuous vs. batch From a crawler s perspective, there s not much difference between new and existing (but newlydiscovered) pages Creating a web index from scratch can be modeled as a streaming problem Seed URLs à stream of discovered outlinks à rinse & repeat Discovering and indexing new content is a streaming problem Batch vs. continuous: both methods work, but continuous offers faster data availability Often important for new content http://ontopic.io 9

Conclusions continuous vs. batch A modern web crawler should: Use resources efficiently Leverage the elasticity of modern cloud infrastructures Be responsive fetch and index new documents with low latency Elegantly handle streams of new content The dynamic web requires a dynamic crawler http://ontopic.io 10

Storm-Crawler: What is it? storm-crawler components A Software Development Kit (SDK) for building and configuring continuous web crawlers Storm components (spouts & bolts) that handle primary web crawling operations Fetching, parsing, and indexing Some of the code has been borrowed (with much gratitude) from Apache Nutch High level of maturity Organized into two sub-projects Core (sc-core): components and utilities needed by all crawler apps External (sc-external): components that depend on external technologies (Elasticsearch and more) http://ontopic.io 11

What is it not? storm-crawler components Storm-Crawler is not a full-featured, ready-to-use web crawler application We re in the process of building that separately will use the Storm-Crawler SDK No explicit link & content management (such as linkdb and crawldb with Nutch) But quickly adding components to support recursive crawls No PageRank http://ontopic.io 12

Basic Topology storm-crawler components Storm topologies consist of spouts and bolts spouts bolts Fetcher 1 Spout URL Partitioner Fetcher 2 Parser 1 Indexer Fetcher (n) Parser (n) http://ontopic.io 13

Spouts storm-crawler components File spout In sc-core Reads URLs from a file Elasticsearch spout In sc-external Reads URLs from an Elasticsearch index Functioning, but we re working on improvements Other options (Redis, Kafka, etc.) Will discuss later in presentation http://ontopic.io 14

Bolts storm-crawler components The SDK includes several bolts that handle: URL partitioning Fetching Parsing Filtering Indexing We ll briefly discuss each of these http://ontopic.io 15

Bolts: URL Partitioner storm-crawler components Partitions incoming URLs by host, domain, or IP address Strategy is configurable in the topology configuration file Creates a partition field in the tuple Storm s grouping feature can then be used to distribute tuples according to requirements localorshuffle() to randomly distribute URLs to fetchers or fieldsgrouping() to ensure all URLs with the same {host, domain, IP} go to the same fetcher http://ontopic.io 16

Bolts: Fetchers storm-crawler components Two fetcher bolts provided in sc-core Both respect robots.txt FetcherBolt Multithreaded (configurable number of threads) Use with fieldsgrouping() on the partition key and a configurable crawl delay to ensure your crawler is polite SimpleFetcherBolt No internal queues Concurrency configured using parallelism hint and # of tasks Politeness must be handled outside of the topology Easier to reason about; requires additional work to enforce politeness http://ontopic.io 17

Bolts: Parsers storm-crawler components Parser Bolt Utilizes Apache Tika for parsing Collects, filters, normalizes, and emits outlinks Collects page metadata (HTTP headers, etc) Parses the page s content to a text representation Sitemap Parser Bolt Uses the Crawler-Commons sitemap parser Collects, filters, normalizes, and emits outlinks Requires a priori knowledge that a page is a sitemap http://ontopic.io 18

Bolts: Indexing storm-crawler components Printer Bolt (in sc-core) Prints output to stdout useful for debugging Elasticsearch Indexer Bolt (in sc-external) Indexes parsed page content and metadata into Elasticsearch Elasticsearch Status Bolt (in sc-external) URLs and their status (discovered, fetched, error) are emitted to a special status stream in the storm topology This bolt indexes the URL, metadata, and its status into a status Elasticsearch index http://ontopic.io 19

Other components storm-crawler components URL Filters & Normalizers Configurable with a JSON file Regex filter & normalizer borrowed from Nutch HostURLFilter enables you to ignore outlinks from outside domains or hosts Parse Filters Useful for scraping and extracting info from pages XPath-based parse filter, more to come Filters & Normalizers are easily pluggable http://ontopic.io 20

Integrating Storm-Crawler integration Because Storm-Crawler is an SDK, it needs to be integrated with other technologies to build a fullfeatured web crawler At the very least, a database For URLs, metadata, and maybe content Some search engines can double as your core data store (beware research Jepsen tests for caveats) Probably a search engine Solr, Elasticsearch, etc. sc-external provides basic integration with Elasticsearch Maybe some distributed system technologies for crawl control Redis, Kafka, ZooKeeper, etc. http://ontopic.io 21

Storm-Crawler at Ontopic integration The storm-crawler SDK is our workhorse for web monitoring Integrated with Apache Kafka, Redis, and several other technologies Running on an EC2 cluster managed by Hortonworks HDP 2.2 http://ontopic.io 22

Architecture integration Redis Elasticsearch Seed List Domain locks Outlink List Logstash events logstash storm-crawler indexing URL Manager (Ruby app) manages Publishes seeds and outlinks to Kafka One topology Seed stream and outlink stream kafka One topic, two partitions Kafka Spout with two executors (one for each topic partition) http://ontopic.io 23

R&D Cluster (AWS) integration Redis 1 x m1.small instance (Redis and Ruby app) logstash 1 x r3.large indexing Elasticsearch storm-crawler manages Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group) kafka Kafka Spout with two executors (one for each topic partition) Publishes seeds and outlinks to Kafka 1 x c3.large http://ontopic.io 24

Integration Examples integration Formal crawl metadata specification & serialization with Avro Kafka publishing bolt Component to publish crawl data to Kafka (complex URL status handling, for example, could be performed by another topology) Externally-stored transient crawl data Components for storing shared crawl data (such as a robots.txt cache) in a key-value store (Redis, Memcached, etc.) http://ontopic.io 25

Use Cases & Users use cases Processing streams of URLs http://www.weborama.com Continuous URL monitoring http://www.shopstyle.com http://www.ontopic.io One-off non-recursive crawling http://www.stolencamerafinder.com Recursive crawling http://www.shopstyle.com More in development & stealth mode http://ontopic.io 26

Demonstration demonstration (live demo of Ontopic s topology) http://ontopic.io 27

Q&A q&a Any questions? http://ontopic.io 28

Resources q&a Project page https://github.com/digitalpebble/storm-crawler Project documentation https://github.com/digitalpebble/storm-crawler/wiki Previous presentations http://www.slideshare.net/digitalpebble/j-niochebristoljavameetup20150310 http://www.slideshare.net/digitalpebble/storm-crawlerontopic20141113?related=1 Other resources http://infolab.stanford.edu/~olston/publications/ crawling_survey.pdf Thank you! http://ontopic.io 29