Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380



Similar documents
Information Retrieval Elasticsearch

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

NS DISCOVER 4.0 ADMINISTRATOR S GUIDE. July, Version 4.0

Search and Information Retrieval

Search and Real-Time Analytics on Big Data

A Linux/Unix Operating System, or Windows Vista/7 with Cygwin installed;

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Apache Solr Reference Guide. Covering Apache Solr 4.4

Data processing goes big

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

Magento Search Extension TECHNICAL DOCUMENTATION

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA, MAPREDUCE & HADOOP

docs.hortonworks.com

Indexing big data with Tika, Solr, and map-reduce

Copyright 2014 Jaspersoft Corporation. All rights reserved. Printed in the U.S.A. Jaspersoft, the Jaspersoft

Version 1.0 January Xerox Phaser 3635MFP Extensible Interface Platform

ICE Trade Vault. Public User & Technology Guide June 6, 2014

Building Multilingual Search Index using open source framework

PHP on IBM i: What s New with Zend Server 5 for IBM i

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Using Apache Solr for Ecommerce Search Applications

DiskPulse DISK CHANGE MONITOR

Scalable Computing with Hadoop

INSTALLING, CONFIGURING, AND DEVELOPING WITH XAMPP

Enterprise Service Bus

ILMT Central Team. Performance tuning. IBM License Metric Tool 9.0 Questions & Answers IBM Corporation

Unifying Search for the Desktop, the Enterprise and the Web

NoSQL Roadshow Berlin Kai Spichale

Big Data and Apache Hadoop s MapReduce

Media Upload and Sharing Website using HBASE

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

The Open Source CMS. Open Source Java & XML

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Instant Chime for IBM Sametime Installation Guide for Apache Tomcat and Microsoft SQL

Installation Guide for contineo

FileMaker Server 9. Custom Web Publishing with PHP

Apache HBase. Crazy dances on the elephant back

shodan-python Documentation

Consuming and Producing Web Services with WST and JST. Christopher M. Judd. President/Consultant Judd Solutions, LLC

INFOPATH FORMS FOR OUTLOOK, SHAREPOINT, OR THE WEB

Practice Fusion API Client Installation Guide for Windows

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

A Performance Analysis of Distributed Indexing using Terrier

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Installation & User Guide

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia

San Jose State University

DataDirect XQuery Technical Overview

Introduction to Hadoop

Crawl Proxy Installation and Configuration Guide

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

W3Perl A free logfile analyzer

Deploying Intellicus Portal on IBM WebSphere

FileMaker Server 15. Custom Web Publishing Guide

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

What s Cooking in KNIME

PHP Language Binding Guide For The Connection Cloud Web Services

FileMaker Server 12. FileMaker Server Help

IBM Tivoli Storage Manager for Microsoft SharePoint

@MIRE. DSpace 1.6 usage statistics: How does it work? Ben Bosman

Business Process Management

SDK Code Examples Version 2.4.2

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

VERSION 9.02 INSTALLATION GUIDE.

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

ekomimeetsmage Manual for version 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0

C:\vufind\solr\authority\conf\solrconfig.xml Freitag, 16. Dezember :54

Xtreeme Search Engine Studio Help Xtreeme

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Mammoth Scale Machine Learning!

Shop by Manufacturer Custom Module for Magento

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

WebSphere Commerce V7 Feature Pack 3

Data Pipeline with Kafka

Flattening Enterprise Knowledge

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support...

BMC Remedy Integration Guide

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

owncloud Architecture Overview

Introduction to Hadoop

A Tool for Evaluation and Optimization of Web Application Performance

Chapter 13. Introduction to SQL Programming Techniques. Database Programming: Techniques and Issues. SQL Programming. Database applications

Hadoop Project for IDEAL in CS5604

Using Actian PSQL as a Data Store with VMware vfabric SQLFire. Actian PSQL White Paper May 2013

IceWarp to IceWarp Server Migration

Sitemap. Component for Joomla! This manual documents version 3.15.x of the Joomla! extension.

Transcription:

Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380

AGENDA 2 > What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features Usage example > Conclusion and alternative solutions

3 About the Speaker > Studied computational linguistics > Java developer > Worked 3.5 years for an Enterprise Search company (using Lucene Java) > Now at Mindquarry, creators on an Open Source Collaboration Software (Mindquarry uses Solr)

4 Question: What is a Search Engine? > Answer: A software that builds an index on text answers queries using that index But we have a database already A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)

5 What is a search engine? (cont.) > Works on words, not on substrings auto!= automatic, automobile > Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index Example: Document 1: Apache Lucene at Jazoon Document 2: Jazoon conference Index: apache -> 1 conference -> 2 jazoon -> 1, 2 lucene -> 1

6 Apache Lucene Overview > Lucene Java 2.2 Java library > Solr 1.2 http-based index and search server > Nutch 0.9 Internet search engine software > http://lucene.apache.org

7 Lucene Java > Java library for indexing and searching > No dependencies (not even a logging framework) > Works with Java 1.4 or later > Input for indexing: Document objects Each document: set of Fields, field name: field content (plain text) > Input for searching: query strings or Query objects > Stores its index as files on disk > No document converters > No web crawler

8 Lucene Java Users > IBM OmniFind Yahoo! Edition > technorati.com > Eclipse > Furl > Nuxeo ECM > Monster.com >...

9 Lucene Java Features > Powerful query syntax > Create queries from user input or programmatically > Fast indexing > Fast searching > Sorting by relevance or other fields > Large and active community > Apache License 2.0

10 Lucene Query Syntax > Query examples: jazoon jazoon AND java <=> +jazoon +java jazoon OR java jazoon NOT php <=> jazoon -php conference AND (java OR j2ee) Java conference title:jazoon j?zoon jaz* schmidt~ schmidt, schmit, schmitt price:[000 TO 050] + more

11 Lucene Code Example: Indexing 01 Analyzer analyzer = new StandardAnalyzer(); 02 IndexWriter iw = new IndexWriter("/tmp/testindex", analyzer, true ); 03 04 Document doc = new Document(); loop 05 doc.add(new Field("body", "This is my TEST document", 06 Field.Store.YES, Field.Index.TOKENIZED)); 07 iw.adddocument(doc); 08 09 iw.optimize(); 10 iw.close(); StandardAnalyzer: my, test, document

12 Lucene Code Example: Searching 01 Analyzer analyzer = new StandardAnalyzer(); 02 IndexSearcher is = new IndexSearcher("/tmp/testindex"); 03 04 QueryParser qp = new QueryParser("body", analyzer); 05 String userinput = "document AND test"; 06 Query q = qp.parse(userinput); 07 Hits hits = is.search(q); 08 for (Iterator iter = hits.iterator(); iter.hasnext();) { 09 Hit hit = (Hit) iter.next(); 10 System.out.println(hit.getScore() + " " + hit.get("body")); 11 } 12 13 is.close();

13 Lucene Hints > Tools: Luke Lucene index browser http://www.getopt.org/luke/ Lucli > Common pitfalls and misconceptions Limit to 10.000 tokens by default see IndexWriter.setMaxFieldLength() There's no error if a field doesn't exist You cannot update single fields You cannot join tables (Lucene is based on documents, not tables) Lucene works on strings only -> 42 is between 1 and 9 Use 0042 Do not misuse Lucene as a database

14 Advanced Lucene Java > Text normalization (Analyzer) Tokenize foo-bar: text -> foo, bar, text Lowercase Linguistic normalization (children -> child) Stopword removal (the, a,...) You can create your own Analyzer (search + index) > Ranking algorithm TF-IDF (term frequency inverse document frequency) You can add your own algorithm Difficult to evaluate

15 Lucene Java: How to get Started > API docs http://lucene.zones.apache.org:8080/hudson/job/lucene- Nightly/javadoc/overview-summary.html#overview_description > FAQ http://wiki.apache.org/lucene-java/lucenefaq

16 Lucene Java Summary > Java Library for indexing and searching > Lightweight / no dependencies > Powerful and fast > No document conversion > No end-user front-end

17 Solr > An index and search server (jetty) > A web application > Requires Java 5.0 or later > Builds on Lucene Java > Programming only to build and parse XML No programming at all using Cocoon > communicates via HTTP index: use http POST to index XML search: use GET request, Solr returns XML Parameters e.g. q = query start rows Future versions will make use without http easier (Java API)

18 Solr Indexing Example > http POST to http://localhost:8983/solr/update <add> <doc> <field name="url">http://www.myhost.org/solr-rocks.html</field> <field name="title">solr is great</field> <field name="creationdate">2007-06-25t12:04:00.000z</field> <field name="content">solr is a great open source search server. It scales, it's easy to configure...</field> </doc> </add> > Delete a document: POST this XML: <delete><query>myid:12345</query></delete>

19 Solr Search Example GET this URL: http://localhost:8983/solr/select/?indent=on&q=solr Response (simplified!): <response> <result name="response" numfound="1" start="0" maxscore="1.0"> <doc> <float name="score">1.0</float> <str name="title">solr is Great</str> <str name="url">http://www.myhost.org/solr-rocks.html</str> </doc> </response>

20 Solr Faceted Browsing > Makes it easy to browse large search results

21 Solr Faceted Browsing (cont.) schema.xml: <field name="topic" type="string" indexed="true" stored="true"/> Query URL: http://.../select?facet=true& facet.field=topic Output from Solr: <lst name="topic"> <int name="genetic algorithms">6</int> <int name="artificial intelligence">3</int>...

22 Solr: How to get Started > Download Solr 1.2 > Install the WAR > Use the post.jar from the exampledocs directory to index some documents > Browse to the admin panel at http://localhost:8080/solr/admin/ and make some searches > Configure schema.xml and solrconfig.xml in WEB-INF/classes > Details at Search smarter with Apache Solr http://www.ibm.com/developerworks/java/library/j-solr1/ http://www.ibm.com/developerworks/java/library/j-solr2/ > FAQ http://wiki.apache.org/solr/faq

23 Solr Summary > A search server > Access via XML sent over http Client doesn't need to be Java > Web-based administration panel > Like Lucene Java, it does no document conversion > Security: make sure your Solr server cannot be accessed from outside!

24 Nutch > Internet search engine software (software only, not the search service) > Builds on Lucene Java for indexing and search > Command line for indexing > Web application for searching > Contains a web crawler > Adds document converters > Issues: Scalability Crawler Politeness Crawler Management Web Spam

25 Nutch Users > Internet Archive www.archive.org > Krugle krugle.com > Several vertical search engines, see http://wiki.apache.org/nutch/publicservers

26 Getting started with Nutch > Download Nutch 0.9 (try SVN in case of problems) > Indexing: add start URLs to a text file configure conf/crawl-urlfilter.txt configure conf/nutch-site.xml command line call bin/nutch crawl urls -dir crawl -depth 3 -topn 50 > Searching: install the WAR search at e.g. http://localhost:8080/

Getting started with Nutch (cont.) 27

Getting started with Nutch (cont.) 28

29 Nutch Summary > Powerful for vertical search engines > Meant for indexing Intranet/Internet via http, indexing local files is possible with some configuration > Not as mature as Lucene and Solr yet > You will need to invest some time

30 Other Lucene Features > Did you mean... Spell checker based on the terms in the index See contrib/spellchecker in Lucene Java > Find similar documents Selects documents similar to a given document, based on the document's significant terms See contrib/queries MoreLikeThis.java in Lucene Java > NON-features: security Lucene doesn't care about security! You need to filter results yourself For Solr, you need to secure http access

31 Other Projects at Apache Lucene > Hadoop - a distributed computing platform Map/Reduce Used by Nutch > Lucene.Net - C# port of Lucene, compatible on any level (API, index,...) Used by Beagle, Wikipedia,...

32 Lucene project The big Picture > Lucene: Java fulltext search library > Solr = Lucene Java + Web administration frontend + HTTP frontend + Typed fields (schema) + Faceted Browsing + Configurable Caching + XML configuration, no Java needed + Document IDs + Replication > Nutch = Lucene Java + Hadoop + Web crawler + Document converters + Web search frontend + Link analysis + Distributed search

33 Alternative Solutions for Search > Commercial vendors (FAST, Autonomy, Google,...) Enterprise search > Commercial search engines based on Lucene and Lucene support (see Wiki) IBM OmniFind Yahoo! Edition > RDBMS with integrated search features Lucene has more powerful syntax and can be easily adapted and integrated > Egothor Lucene has a much bigger community

34 Conclusion > - no Enterprise Search (but: Intranet indexing using Nutch) > + can be embedded or integrated in almost any situation > + fast > + powerful > + large, helpful community > + the quasi-standard in Open Source search

Daniel Naber www.danielnaber.de dnaber@apache.org Mindquarry GmbH www.mindquarry.com Presentation license: http://creativecommons.org/licenses/by/3.0/