Indexing big data with Tika, Solr, and map-reduce

Size: px

Start display at page:

Download "Indexing big data with Tika, Solr, and map-reduce"

Clyde Lucas Leonard
10 years ago
Views:

1 Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

2 Outline Introduction Introduction Tika Pig Solr Done! Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

3 Introduction Web Archiving Service Service provided by the California Digital Library Fee-based Archiving web sites, as selected by curators Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

web sites, as selected by curators Scott Fisher, Erik

4 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

5 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

6 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

7 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

8 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

9 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

10 Tools Introduction Open source and rails UI for crawl management and display of many focused web crawls. Heritrix - NutchWAX - Wayback Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

11 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

no longer default with Nutch itself Deduplicating content requires a more

12 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

13 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

14 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

15 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

16 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

17 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

18 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

19 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

20 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

21 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

22 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

23 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

24 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

25 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

26 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

27 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

28 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

29 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

30 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

31 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

32 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

33 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

34 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

35 Pig example Pig Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

36 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

37 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

38 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

39 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

40 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

41 Faceting Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

42 Faceting 2 Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

43 Mime types Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

44 Solr XML Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

45 Finale Introduction Be careful when you try to parse at a bunch of files you downloaded from the web. Parse and store. Distribute up front. Build a test index first. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities: