How To Write A Search Engine Indexer For A Dog



Similar documents
Search and Information Retrieval

Information Retrieval Elasticsearch

Lucene in Action OTIS GOSPODNETIC ERIK HATCHER MANNING. Greenwich (74 w. long.)

Terrier: A High Performance and Scalable Information Retrieval Platform

Data Discovery on the Information Highway

Using Apache Solr for Ecommerce Search Applications

Building Multilingual Search Index using open source framework

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Analysis of Web Archives. Vinay Goel Senior Data Engineer

A Comparison of Open Source Search Engines. Christian Middleton, Ricardo Baeza-Yates

How To Install Linux Titan

CatDV Pro Workgroup Serve r

W3Perl A free logfile analyzer

StruxureWare Data Center Expert Release Notes

Binonymizer A Two-Way Web-Browsing Anonymizer

A Performance Analysis of Distributed Indexing using Terrier

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

MO 25. Aug. 2008, 17:00 UHR RICH INTERNET APPLICATIONS MEHR BISS FÜR WEBANWENDUNGEN

Evaluation of Load/Stress tools for Web Applications testing

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Collaborative Open Market to Place Objects at your Service

StruxureWare Data Center Expert Release Notes

Load and Performance Load Testing. RadView Software October

System Services. Engagent System Services 2.06

WHITE PAPER. Domo Advanced Architecture

XpoLog Center Suite Log Management & Analysis platform

Distributed Computing and Big Data: Hadoop and MapReduce

AlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide

Flattening Enterprise Knowledge

Course Scheduling Support System

Efficiency Considerations of PERL and Python in Distributed Processing

Reusability of WSDL Services in Web Applications

Software Design April 26, 2013

MD Link Integration MDI Solutions Limited

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Efficiency of Web Based SAX XML Distributed Processing

ITG Software Engineering

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Inmagic Content Server Workgroup Configuration Technical Guidelines

FileMaker 11. ODBC and JDBC Guide

DocDokuPLM Innovative PLM solution

Cross Platform Applications with IBM Worklight

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Client Overview. Engagement Situation. Key Requirements for Platform Development :

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

MEGA Web Application Architecture Overview MEGA 2009 SP4

A Tool for Evaluation and Optimization of Web Application Performance

Implementing Mobile Thin client Architecture For Enterprise Application

SEARCHING THE SUNDIAL MAIL ARCHIVE by Carl Sabanski / edited by Mac Oglesby

WEB SERVICES FOR MOBILE COMPUTING

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Snare System Version Release Notes

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Qlik REST Connector Installation and User Guide

USE OF PYTHON AS A SATELLITE OPERATIONS AND TESTING AUTOMATION LANGUAGE

ifinder ENTERPRISE SEARCH

Assignment # 1 (Cloud Computing Security)

Zend Server 4.0 Beta 2 Release Announcement What s new in Zend Server 4.0 Beta 2 Updates and Improvements Resolved Issues Installation Issues

FileMaker 12. ODBC and JDBC Guide

XpoLog Center Suite Data Sheet

Web. Services. Web Technologies. Today. Web. Technologies. Internet WWW. Protocols TCP/IP HTTP. Apache. Next Time. Lecture # Apache.

Windows PCs & Servers are often the life-blood of your IT investment. Monitoring them is key, especially in today s 24 hour world!

Visualizing a Neo4j Graph Database with KeyLines

WIRIS quizzes web services Getting started with PHP and Java

Information Retrieval Systems in XML Based Database A review

Functions of NOS Overview of NOS Characteristics Differences Between PC and a NOS Multiuser, Multitasking, and Multiprocessor Systems NOS Server

Selenium WebDriver. Gianluca Carbone. Selenium WebDriver 1

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

BogDan Vatra and Andy Gryc. Qt on Android: Is it right for you?

A Universal Visualization Platform

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

The New RERO Statistics Services

Benjamin Carlson Node.js SQL Server for Data Analytics and Multiplayer Games Excerpt from upcoming MQP-1404 Project Report Advisor: Brian Moriarty

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Snare System Version Release Notes

Actuate Business Intelligence and Reporting Tools (BIRT)

Symantec ediscovery Platform, powered by Clearwell

BarTender Web Print Server

ICE Trade Vault. Public User & Technology Guide June 6, 2014

Search Engine Technology and Digital Libraries: Moving from Theory t...

CS 558 Internet Systems and Technologies

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

Wiley. Automated Data Collection with R. Text Mining. A Practical Guide to Web Scraping and

Open EMS Suite. O&M Agent. Functional Overview Version 1.2. Nokia Siemens Networks 1 (18)

Mobility Information Series

Network operating systems typically are used to run computers that act as servers. They provide the capabilities required for network operation.

Unifying Search for the Desktop, the Enterprise and the Web

FileMaker Server 15. Custom Web Publishing Guide

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory

Institutionen för datavetenskap Department of Computer and Information Science

Help. F-Secure Online Backup

CrossPlatform ASP.NET with Mono. Daniel López Ridruejo

17 March 2013 NIEM Web Services API Version 1.0 URI:

Easy configuration of NETCONF devices

EFFECTIVE STRATEGIES FOR SEARCHING ORACLE UCM. Alan Mackenthun Senior Software Consultant 4/23/2010. F i s h b o w l S o l u t I o n s

Transcription:

Open Source IR Tools and Libraries Giorgos Vasiliadis, gvasil@csd.uoc.gr CS-463 Information Retrieval Models Computer Science Department University of Crete 1

Outline Google Search API Lucene Terrier Lemur 2

Google Search API 3

Google Search API: Overview The API exposes the Google engine to developers. You can write scripts that access the Google search in real-time. Google no longer issuing new API keys for the SOAP Search API. Instead, Google provides an AJAX Search API. You can put Google Search in your web pages with JavaScript. 4

Google Search API: SOAP Based on the Web Services Technology SOAP (the XML-based Simple Object Access Protocol). Developers write software programs that connect remotely to the Google SOAP Search API service. Developers can issue search requests to Google s index of billions of web pages and receive results as structured data, access information in the Google cache and check the spelling of words. Limitations Default limit of 1,000 queries per day. Can only query for 10 results a time Can only access Google Web Search (not Google Images, Google Groups and so on). 5

Google Search API: AJAX Lets you put Google Search in your web pages with JavaScript. Does not have a limit on the number of queries per day. Supports additional features like Video, News, Maps, and Blog search results. 6

Google Search API: AJAX Web Search Incorporate results from Web Search, News Search, and Blog Search Local Search Provides access to local search results from Google Maps. Video Search Incorporate a simple search box incorporate dynamic, search powered strips of video and book thumbnails. 7

Google Search API: Demo 8

Google Search API: References Google SOAP Search API http://code.google.com/apis/soapsearch/ Google AJAX Search API http://code.google.com/apis/ajaxsearch/ Google AJAX Search API Developer Guide http://code.google.com/apis/ajaxsearch/documentation/ Google AJAX Search API Samples http://code.google.com/apis/ajaxsearch/samples.html 9

Lucene 10

Lucene Doug Cutting s grandmother s middle name Cross-Platform API Implemented in Java Ported in C++, C#, Perl, Python Offers scalable, high-performance indexing Incremental indexing as fast as bath indexing Index size roughly 20-30% the size of indexed text Supports many powerful query types 11

Lucene: Modules Analysis Tokenization, Stop words, Stemming, etc. Document Unique ID for each document Title of document, date modified, content, etc. Index Provides access and maintains indexes. Query Parser Search / Search Spans 12

Lucene: Indexing A Document is a collection of Fields Document: Field 1 Field 2 Field N A Field is free text, keywords, dates, etc. A Field can have several characteristics indexed, tokenized, stored, term vectors Apply Analyzer to alter Tokens during indexing Stemming Stop-word removal Phrase identification 13

Lucene: Searching Uses a modified Vector Space Model We convert a user s query into an internal representation that can be searched against the Index Queries are usually analyzed in the same manner as the index Get back Hits and use in an application 14

Lucene: Query Parser Syntax Terms Fields Single terms and phrases E.g. title:"do it right" AND right Wildcard Searches? for single character * for multiple characters Proximity Searches jakarta apache"~10 Fuzzy Searches Levenshtein Distance or Edit Distance algorithm Range Searches mod_date:[20020101 TO 20030101] title:{aida TO Carmen} Boosting a Term E.g. jakarta^4 apache Boolean Operators 15

Lucene: More Advanced Options Relevance Feedback Manual User selects which documents are relevant/non-relevant Get the terms from the term vector for each of the documents and construct a new query. Automatic Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents. Span Queries Phrase Matching 16

Lucene: Basic Demo The latest version can be obtained from http://www.apache.org/dyn/closer.cgi/lucene/java/ To build an index just type java org.apache.lucene.demo.indexfiles <dir> To search from an index type: java org.apache.lucene.demo.searchfiles <index> 17

Terrier 18

Terrier: Overview Stands for TERabyte RetrIEveR. Open Source API (Mozilla Public Licence). Written in cross-platform Java. Highly compressed disk data structures. Handling large-scale document collections. Standard evaluation of TREC ad-hoc and known-item search retrieval results. 19

Terrier: Indexing Create your own Collection decoder and Document implementation. Centralized or distributed Setting. Indexer iterates through the collection and creates the following data structures Direct Index Document Index Lexicon 20

Terrier: Indexing Each document in the collection is tokenized and parsed. 21

Terrier: Indexing In this way, we build the direct and document indices. 22

Terrier: Indexing We also build temporary lexicons in order to reduce the required memory during indexing 23

Terrier: Indexing The inverted index is built from the existing direct index, document index and lexicon 24

Terrier: Retrieval Parsing Pre-processing Matching Post Processing Post Filtering Query Language term1 term2 term1^2.3 +term1 -term2 "term1 term2"~n 25

Terrier: Retrieval Remove stop words and apply stemming to the query. 26

Terrier: Retrieval Terrier automatically select the optimal document weighting model 27

Terrier: Retrieval If Query Expansion is applied an appropriate term weighting model is selected and the most informative terms from the top ranked documents are added to the query 28

Terrier: Sample Applications Trec Terrier An application that allows Terrier to index and retrieve from standard TREC test collections. Instructions are available at http://ir.dcs.gla.ac.uk/terrier/doc/trec_terrier.html 29

Terrier: Sample Applications Desktop Search A Swing (graphical) application that can be used to index files from the local machine, and then perform queries on them. The scripts for running the desktop search application are: desktop_search.sh (Linux, Mac OSX) desktop_search.bat (Windows) 30

Terrier: Sample Applications Interactive Querying A console application for performing simple queries on an existing index and seeing which documents are returned. The scripts for running the console application are: interactive_terrier.sh (Linux, Mac OS X) interactive_terrier.bat (Windows) 31

Terrier: Demo 32

Lemur 33

Lemur: Overview Support for XML and structured document retrieval Interactive interfaces for Windows, Linux, and Web Cross-Platform, fast and modular code writteninc++ C++, Java and C# APIs Free and open-source software 34

Lemur: API Provides interfaces to Lemur classes that are grouped at three different levels: Utility level Common utilities, such as memory management, document parsing, etc. Indexer level Converts a raw text collection to data structures for efficient retrieval. Retrieval level Abstract classes for a general retrieval architecture and concrete classes for several specific information retrieval 35

Lemur: Indexing Multiple indexing methods for small, medium and large-scale (terabyte) collections. Built-in support for English, Chinese and Arabic text. Porter and Krovetz word stemming. Incremental indexing. 36

Lemur: Retrieval Supports major language modelling approaches such as Indri and KL-divergence, as well as vector space, tf-idf, Ocapi and InQuery Relevance- and pseudo-relevance feedback Wildcard term expansion (using Indri) Supports arbitrary document priors (e.g., Page Rank, URL depth) 37

Questions? 38