EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene Andreas Kamilaris Department of Computer Science Created by Andreas Kamilaris for EPL660

Research on the Web of Things 2

General info Every Friday 18:00-19:30. Check course Web site for schedule. Lab content - Exercises, general questions, tutorials, tool demonstrations. Deadlines of exercises: 23:59 at delivery day. Email submission: kami@cs.ucy.ac.cy EPL660 3

Tutorials info Review of tools for Information Retrieval. Every lab session includes introducing some tool. A variety of libraries and tools: Apache Lucene Apache Solr Apache Tika Hadoop Nutch EPL660 4

Program info Date 28/01 4/02 11/02 18/02 25/02 4/03 11/03 18/03 25/03 01/04 8/04 15/04 Topic Apache Lucene Apache Lucene Apache Solr Apache Tika No Tutorial Hadoop Hadoop Nutch No Tutorial No Tutorial Nutch Projects Presentations Description Introduction to Apache Lucene Background Information for B-Trees Getting Started with Apache Lucene Demonstration of a simple scenario Introduction to Apache Solr Demonstration of a simple scenario Introduction to Apache Tika Demostration of a simple scenario Absence of Assistant Background information about MapReduce Introduction to Hadoop Getting Started with Hadoop Demonstration of a simple scenario Background Information about Crawling Introduction to Nutch Public Holiday Public Holiday Getting Started with Nutch Presentation of the students final project EPL660 5

1 st Programming Exercise Create a doc-based inverted index. Records have the format: term Frequency Positional Posting List Include stemming using Porter Stemmer algorithm. Include detection of stop-words. Search terms using B-Trees. The B-Tree must be a 4-ordered tree. Add skip pointers to inverted index for performance reasons. EPL660 6

1 st Programming Exercise Deadline is 8 th February 2011. You need to include: Source code with comments. Executable files. A Brief Documentation. E-mail Submission including a zip attachment. EPL660 7

Introduction to B-Trees A B-Tree of order m is an m-way tree (a tree where each node may have up to m children) in which: 1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree. 2. all leaves are on the same level. 3. all non-leaf nodes except the root have at least m / 2 children. 4. the root is either a leaf node, or it has from two to m children. 5. a leaf node contains no more than m 1 keys. B-trees are always balanced! EPL660 8

Why using B-Trees It was difficult to access a large amount of data from a secondary memory. Many algorithms were introduced to make search faster, to access the required data from the secondary memory more optimized. B-Trees are more effective and faster. B-Trees are used in many database management systems. EPL660 9

An example B-Tree A B-tree of order 4 containing 26 items: 6 12 26 1 2 4 7 8 13 15 18 25 42 51 62 27 29 45 46 48 53 55 60 64 70 90 Note that all the leaves are at the same level EPL660 10

Searching a B-Tree Search for the item #48: 6 12 26 1 2 4 7 8 13 15 18 25 42 51 62 27 29 45 46 48 53 55 60 64 70 90 Note that all the leaves are at the same level EPL660 11

Constructing a B-Tree Suppose we start with an empty B-tree and keys arrive in the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45 We want to construct a B-tree of order 5 The first four items go into the root: 1 2 8 12 To put the fifth item in the root would violate condition 5 Therefore, when 25 arrives, pick the middle key to make a new root EPL660 12

Constructing a B-Tree 8 1 2 12 25 6, 14, 28 get added to the leaf nodes: 8 1 2 6 12 14 25 28 EPL660 13

Constructing a B-Tree Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf: 8 17 1 2 6 12 14 25 28 7, 52, 16, 48 get added to the leaf nodes: 8 17 1 2 6 7 12 14 16 25 28 48 52 EPL660 14

Constructing a B-Tree Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf, promoting 3 to the root; 26, 29, 53, 55 then go into the leaves: 3 8 17 48 1 2 6 7 12 14 16 25 26 28 29 52 53 55 68 Adding 45 causes a split of: 25 26 28 29 and promoting 28 to the root then causes the root to split. EPL660 15

Constructing a B-Tree 17 3 8 28 48 1 2 6 7 12 14 16 25 26 29 45 52 53 55 68 EPL660 16

Guidelines for constructing a B-Tree 1. Attempt to insert the new key into a leaf by searching for the proper position. 2. If the leaf is not full, then insert the key and you are done. 3. If this would result in that leaf becoming too big, split the leaf into two, promoting the middle key to the leaf s parent 4. If this would result in the parent becoming too big, split the parent into two, promoting the middle key. 5. This strategy might have to be repeated all the way to the top. 6. If necessary, the root is split in two and the middle key is promoted to a new root, making the tree one level higher. EPL660 17

Time complexity of a B-Tree Search/Insert/Delete all take up to the number of items in a path from the root to a leaf. The total number of operations is no more than the height of the tree. The height of a tree is no more than log(n) where n is the number of items in the B-Tree. EPL660 18

Tutorial 1 Apache Lucene Overview Department of Computer Science

What is Apache Lucene? Apache Lucene is a high-performance, fullfeatured text search engine library written entirely in Java. -from http://lucene.apache.org/ EPL660 20

What is Apache Lucene? Lucene is specifically an API, not an application. Hard parts have been done, easy programming has been left to you. You can build a search application that is specifically suited to your needs. You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on). EPL660 21

Availability Freely Available (no cost) Open Source Apache License, version 2.0 http://www.apache.org/licenses/license-2.0 Download from: http://www.apache.org/dyn/closer.cgi/lucene/java/ EPL660 22

Features Ranked Searching Flexible Queries Phrases, Wildcards, etc Field-specific Queries e.g. title, artist, album Sorting EPL660 23

Ranked Searching 1. Phrase Matching 2. Keyword Matching Prefer more unique terms first takes into account the uniqueness of each term when determining a document s relevance score EPL660 24

Flexible Queries Phrases star wars Wildcards star* Bra?il Ranges {star-stun} [2006-2007] Boolean Operators star AND wars This is just a small subset of the types of queries that Lucene can support. Some query types such as wildcard and range queries have a potential to cause heavy load on the Lucene server, so Lucene makes it easy to disable certain types of queries while allowing all others to proceed through the system. This gives programmers better control and allows the system performance to be more predictable. EPL660 25

Field-specific Queries For example title: star wars AND director: George Lucas EPL660 26

Sorting Can sort any field in a Document For example, by Price, Release Date, Amazon Sales Rank, etc By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported. EPL660 27

Documents A document can represent anything textual: Word Document DVD (the textual metadata only) Website Member (name, ID, etc ) A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database. Each developer is responsible for turning their own data sets into Lucene Documents. Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files. EPL660 28

Indexes Lucene employs inverted indexing (like most full-textbased search engines). Indexes track term frequencies. Every term maps back to a Document. This index is what allows Lucene to quickly locate every document currently associated with a given set of input search terms. EPL660 29

Basic Indexing An index consists of one or more Lucene documents. 1. Create a document: A document consists of one or more fields: name-value pair Example: A field commonly found in applications is title. In the case of a title field, the field name is title and the value is the title of that item. Add one or more fields to the document. 2. Add the document to an index: Indexing involves adding documents to an IndexWriter. 3. Indexer will analyze the Document: We can provide specialized analyzers such as StandardAnalyzer. EPL660 30

Analyzing Analyzers control how the text is broken into terms which are then used to index the document. Analyzers can be used to remove stop words and they also perform stemming. Lucene comes with a default analyzer which works well for unstructured English text, however it often performs incorrect normalizations on non-english texts. Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own. Lucene even includes a number of stemming algorithms for various languages, which can improve document retrieval accuracy when the source language is known at indexing time. EPL660 31

Basic Searching Searching requires an index to have already been built. 1. Create a Query: Usually via QueryParser, MultiPhraseQuery etc. that parse user input. 2. Open an Index: 3. Search the Index: E.g. via IndexSearcher. Use an Analyzer (as before). 4. Iterate through returned Documents: Extract out needed results. Extract out result scores (if needed). EPL660 32

Lucene as a Web Service 1. Design an HTTP query syntax GET queries XML for results 2. Wrap Tomcat around core code Tomcat is a source software implementation of the Java Servlet and JavaServer Pages technologies 3. Write a Client Library EPL660 33

Scalability Limits 3 main scalability factors: Query Rate Index Size Update Rate EPL660 34

Query Rate Scalability Lucene is already fast: Built-in simple cache mechanism Easy solution for heavy workloads: Add more query servers behind a load balancer Can grow as your traffic grows EPL660 35

Index Size Scalability Can easily handle millions of documents Lucene is very commonly deployed into systems with 10s of millions of documents. Although query performance can degrade as more documents are added to the index, the growth factor is very low. The main limits related to index size that you are likely to run into, will be disk capacity and disk I/O limits. If you need bigger index: Built-in methods to allow queries to span multiple remote Lucene indexes Can merge multiple remote indexes at query-time. EPL660 36

Lucene Installation 1. Download the latest version of Lucene (v3.0.3) from: http://www.apache.org/dyn/closer.cgi/lucene/java/ 2. Add files lucene-core-{version}.jar and lucene-demos- {version}.jar in your Java CLASSPATH. 3. Start programming! (Optional Step) 4. Go to Lucene-{version}/src/demo/org/apache/lucene/demo directory and start editing files IndexFiles.java and SearchFiles.java. EPL660 37

Useful Info Official Apache Lucene site: http://lucene.apache.org/java/docs/ Lucene-java Wiki: http://wiki.apache.org/lucenejava/frontpage?action=show&redirect=frontpageen Lucene Intro (java.net): http://today.java.net/pub/a/today/2003/07/30/luceneintro.html Lucene Tutorial.com: http://www.lucenetutorial.com/ EPL660 38