Indexing big data with Tika, Solr, and map-reduce

Similar documents

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

CSE-E5430 Scalable Cloud Computing Lecture 2

PASIG San Diego March 13, 2015

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Archive-IT Services Andrea Mills Booksgroup Collections Specialist

Full Text Search of Web Archive Collections

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Creating a billion-scale searchable web archive. Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Open source Google-style large scale data analysis with Hadoop

WEB ARCHIVING AT SCALE

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Apache Tika for Enabling Metadata Interoperability

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Building Multilingual Search Index using open source framework

Big Data on Microsoft Platform

How To Understand Web Archiving Metadata

A Service for Data-Intensive Computations on Virtual Clusters

MapReduce with Apache Hadoop Analysing Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr

NoSQL Roadshow Berlin Kai Spichale

Applying Apache Hadoop to NASA s Big Climate Data!

IIPC Metadata Workshop. Brad Tofel Vinay Goel Aaron Binns Internet Archive

Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Search and Real-Time Analytics on Big Data

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

Vaidya Guide. Table of contents

Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)

Case Study : 3 different hadoop cluster deployments

SPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends. Otis Gospodnetić sematext.

Unified Batch & Stream Processing Platform

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop IST 734 SS CHUNG

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Web and Twitter Archiving at the Library of Congress

FISHER CAM-H300C-3F CAM-H650C-3F CAM-H1300C-3F

THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop

KEYSTONE - Short scientific report for STSM visit in TU Delft

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Collaborative Open Source Software Production & APIs. Tom Cramer Stanford

Scalable Cloud Computing

The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project

Homework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT

Hadoop Ecosystem B Y R A H I M A.

Big Data for Investment Research Management

File S1: Supplementary Information of CloudDOE

Real Time Analytics for Big Data. NtiSh Nati

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Practical Options for Archiving Social Media

Optimization of Distributed Crawler under Hadoop

Large scale processing using Hadoop. Ján Vaňo

How To Run Apa Tika On A Microsoft Macbook Or Ipa.Net (For Linux) Or Ipad (For Windows) (For Macbook) (Or Ipa) (On Linux) (Minor) (Large

Open Source Server Product Description

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Big Data Infrastructure at Spotify

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

From Wikipedia, the free encyclopedia

BIG DATA What it is and how to use?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Scaling Out With Apache Spark. DTL Meeting Slides based on

Click Stream Data Analysis Using Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Oracle Big Data Fundamentals Ed 1 NEW

Apache Hadoop. Alexandru Costan

Big Data for Investment Research Management

Hadoop Architecture. Part 1

Web Crawling with Apache Nutch

Cloudera Certified Developer for Apache Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Journal of Environmental Science, Computer Science and Engineering & Technology

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

the missing log collector Treasure Data, Inc. Muga Nishizawa

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Improving MapReduce Performance in Heterogeneous Environments

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Bigg-Data LLC, Data Scientists Hadoop Developers/Administrators

Scalable Computing with Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop Architecture and its Usage at Facebook

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Hadoop Big Data for Processing Data and Performing Workload

Moving From Hadoop to Spark

Developing a MapReduce Application

Map Reduce & Hadoop Recommended Text:

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Shared service components infrastructure for enriching electronic publications with online reading and full-text search

Transcription:

Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 1 / 19

Outline Introduction Introduction Tika Pig Solr Done! Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 2 / 19

Introduction Web Archiving Service Service provided by the California Digital Library Fee-based Archiving web sites, as selected by curators Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 3 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 4 / 19

Tools Introduction Open source and rails UI for crawl management and display of many focused web crawls. Heritrix - NutchWAX - Wayback Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 5 / 19

Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 6 / 19

Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 6 / 19

Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 6 / 19

Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 6 / 19

Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 6 / 19

Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 7 / 19

Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 7 / 19

Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 7 / 19

Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 8 / 19

Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 8 / 19

Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 8 / 19

Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 8 / 19

Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 8 / 19

Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 9 / 19

Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 9 / 19

Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 9 / 19

Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 10 / 19

Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 10 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 11 / 19

Pig example Pig Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 12 / 19

Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 13 / 19

Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 13 / 19

Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 13 / 19

Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 14 / 19

Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 14 / 19

Faceting Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 15 / 19

Faceting 2 Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 16 / 19

Mime types Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 17 / 19

Solr XML Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 18 / 19

Finale Introduction Be careful when you try to parse at a bunch of files you downloaded from the web. Parse and store. Distribute up front. Build a test index first. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 19 / 19