EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene

Similar documents
Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Indexing big data with Tika, Solr, and map-reduce

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Efficiency of Web Based SAX XML Distributed Processing

Big Data With Hadoop

Lucene in Action OTIS GOSPODNETIC ERIK HATCHER MANNING. Greenwich (74 w. long.)

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Analysis of Web Archives. Vinay Goel Senior Data Engineer

NoSQL Roadshow Berlin Kai Spichale

JReport Server Deployment Scenarios

Apache HBase. Crazy dances on the elephant back

Big Data and Scripting. Part 4: Memory Hierarchies

CSE-E5430 Scalable Cloud Computing Lecture 2

Search and Real-Time Analytics on Big Data

Information Retrieval Elasticsearch

In Memory Accelerator for MongoDB

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scalable Computing with Hadoop

Integrating VoltDB with Hadoop

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Apache Hadoop. Alexandru Costan

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Hadoop Ecosystem B Y R A H I M A.

A programming model in Cloud: MapReduce

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

CSE 326: Data Structures B-Trees and B+ Trees

A Performance Analysis of Distributed Indexing using Terrier

Large Scale Text Analysis Using the Map/Reduce

Full Text Search in MySQL 5.1 New Features and HowTo

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

Introduction to Parallel Programming and MapReduce

CatDV Pro Workgroup Serve r

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Data processing goes big

Information Retrieval Systems in XML Based Database A review

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

InfiniteGraph: The Distributed Graph Database

Hypertable Architecture Overview

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

CSE454 Project Part4: Dealer s Choice Assigned: Monday, November 28, 2005 Due: 10:30 AM, Thursday, December 15, 2005

HDFS. Hadoop Distributed File System

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

EFFECTIVE STRATEGIES FOR SEARCHING ORACLE UCM. Alan Mackenthun Senior Software Consultant 4/23/2010. F i s h b o w l S o l u t I o n s

Big Data and Apache Hadoop s MapReduce

Using EMC Documentum with Adobe LiveCycle ES

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache HBase: the Hadoop Database

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Using Apache Solr for Ecommerce Search Applications

The Open Source Knowledge Discovery and Document Analysis Platform

A. Aiken & K. Olukotun PA3

Oracle Database 11g: SQL Tuning Workshop

Content Management Implementation Guide 5.3 SP1

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Content Based Search Add-on API Implemented for Hadoop Ecosystem

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Hadoop and Map-Reduce. Swati Gore

Big Systems, Big Data

Inmagic Content Server Workgroup Configuration Technical Guidelines

Electronic Document Management Using Inverted Files System

NoSQL and Hadoop Technologies On Oracle Cloud

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

The Hadoop Framework

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Unifying Search for the Desktop, the Enterprise and the Web

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Couchbase Server Under the Hood

Scalable Forensics with TSK and Hadoop. Jon Stewart

Things to consider before you do an In-place upgrade to Windows 10. Setup Info. In-place upgrade to Windows 10 Enterprise with SCCM

WIRIS quizzes web services Getting started with PHP and Java

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Search and Information Retrieval

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Heaps & Priority Queues in the C++ STL 2-3 Trees

Exchange Brick-level Backup and Restore

Glassfish, JAVA EE, Servlets, JSP, EJB

How to Choose Between Hadoop, NoSQL and RDBMS

How to Run Spark Application

Hadoop Streaming. Table of contents

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

HADOOP MOCK TEST HADOOP MOCK TEST I

Apache Hadoop FileSystem Internals

Developing a MapReduce Application

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

CiteSeer x in the Cloud

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Physical Data Organization

Xtreeme Search Engine Studio Help Xtreeme

Building Multilingual Search Index using open source framework

Distributed Lucene : A distributed free text index for Hadoop

Transcription:

EPL 660: Lab 1 General Info, Exercise 1, B-Trees, Apache Lucene Andreas Kamilaris Department of Computer Science Created by Andreas Kamilaris for EPL660

Research on the Web of Things 2

General info Every Friday 18:00-19:30. Check course Web site for schedule. Lab content - Exercises, general questions, tutorials, tool demonstrations. Deadlines of exercises: 23:59 at delivery day. Email submission: kami@cs.ucy.ac.cy EPL660 3

Tutorials info Review of tools for Information Retrieval. Every lab session includes introducing some tool. A variety of libraries and tools: Apache Lucene Apache Solr Apache Tika Hadoop Nutch EPL660 4

Program info Date 28/01 4/02 11/02 18/02 25/02 4/03 11/03 18/03 25/03 01/04 8/04 15/04 Topic Apache Lucene Apache Lucene Apache Solr Apache Tika No Tutorial Hadoop Hadoop Nutch No Tutorial No Tutorial Nutch Projects Presentations Description Introduction to Apache Lucene Background Information for B-Trees Getting Started with Apache Lucene Demonstration of a simple scenario Introduction to Apache Solr Demonstration of a simple scenario Introduction to Apache Tika Demostration of a simple scenario Absence of Assistant Background information about MapReduce Introduction to Hadoop Getting Started with Hadoop Demonstration of a simple scenario Background Information about Crawling Introduction to Nutch Public Holiday Public Holiday Getting Started with Nutch Presentation of the students final project EPL660 5

1 st Programming Exercise Create a doc-based inverted index. Records have the format: term Frequency Positional Posting List Include stemming using Porter Stemmer algorithm. Include detection of stop-words. Search terms using B-Trees. The B-Tree must be a 4-ordered tree. Add skip pointers to inverted index for performance reasons. EPL660 6

1 st Programming Exercise Deadline is 8 th February 2011. You need to include: Source code with comments. Executable files. A Brief Documentation. E-mail Submission including a zip attachment. EPL660 7

Introduction to B-Trees A B-Tree of order m is an m-way tree (a tree where each node may have up to m children) in which: 1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree. 2. all leaves are on the same level. 3. all non-leaf nodes except the root have at least m / 2 children. 4. the root is either a leaf node, or it has from two to m children. 5. a leaf node contains no more than m 1 keys. B-trees are always balanced! EPL660 8

Why using B-Trees It was difficult to access a large amount of data from a secondary memory. Many algorithms were introduced to make search faster, to access the required data from the secondary memory more optimized. B-Trees are more effective and faster. B-Trees are used in many database management systems. EPL660 9

An example B-Tree A B-tree of order 4 containing 26 items: 6 12 26 1 2 4 7 8 13 15 18 25 42 51 62 27 29 45 46 48 53 55 60 64 70 90 Note that all the leaves are at the same level EPL660 10

Searching a B-Tree Search for the item #48: 6 12 26 1 2 4 7 8 13 15 18 25 42 51 62 27 29 45 46 48 53 55 60 64 70 90 Note that all the leaves are at the same level EPL660 11

Constructing a B-Tree Suppose we start with an empty B-tree and keys arrive in the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45 We want to construct a B-tree of order 5 The first four items go into the root: 1 2 8 12 To put the fifth item in the root would violate condition 5 Therefore, when 25 arrives, pick the middle key to make a new root EPL660 12

Constructing a B-Tree 8 1 2 12 25 6, 14, 28 get added to the leaf nodes: 8 1 2 6 12 14 25 28 EPL660 13

Constructing a B-Tree Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf: 8 17 1 2 6 12 14 25 28 7, 52, 16, 48 get added to the leaf nodes: 8 17 1 2 6 7 12 14 16 25 28 48 52 EPL660 14

Constructing a B-Tree Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf, promoting 3 to the root; 26, 29, 53, 55 then go into the leaves: 3 8 17 48 1 2 6 7 12 14 16 25 26 28 29 52 53 55 68 Adding 45 causes a split of: 25 26 28 29 and promoting 28 to the root then causes the root to split. EPL660 15

Constructing a B-Tree 17 3 8 28 48 1 2 6 7 12 14 16 25 26 29 45 52 53 55 68 EPL660 16

Guidelines for constructing a B-Tree 1. Attempt to insert the new key into a leaf by searching for the proper position. 2. If the leaf is not full, then insert the key and you are done. 3. If this would result in that leaf becoming too big, split the leaf into two, promoting the middle key to the leaf s parent 4. If this would result in the parent becoming too big, split the parent into two, promoting the middle key. 5. This strategy might have to be repeated all the way to the top. 6. If necessary, the root is split in two and the middle key is promoted to a new root, making the tree one level higher. EPL660 17

Time complexity of a B-Tree Search/Insert/Delete all take up to the number of items in a path from the root to a leaf. The total number of operations is no more than the height of the tree. The height of a tree is no more than log(n) where n is the number of items in the B-Tree. EPL660 18

Tutorial 1 Apache Lucene Overview Department of Computer Science

What is Apache Lucene? Apache Lucene is a high-performance, fullfeatured text search engine library written entirely in Java. -from http://lucene.apache.org/ EPL660 20

What is Apache Lucene? Lucene is specifically an API, not an application. Hard parts have been done, easy programming has been left to you. You can build a search application that is specifically suited to your needs. You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on). EPL660 21

Availability Freely Available (no cost) Open Source Apache License, version 2.0 http://www.apache.org/licenses/license-2.0 Download from: http://www.apache.org/dyn/closer.cgi/lucene/java/ EPL660 22

Features Ranked Searching Flexible Queries Phrases, Wildcards, etc Field-specific Queries e.g. title, artist, album Sorting EPL660 23

Ranked Searching 1. Phrase Matching 2. Keyword Matching Prefer more unique terms first takes into account the uniqueness of each term when determining a document s relevance score EPL660 24

Flexible Queries Phrases star wars Wildcards star* Bra?il Ranges {star-stun} [2006-2007] Boolean Operators star AND wars This is just a small subset of the types of queries that Lucene can support. Some query types such as wildcard and range queries have a potential to cause heavy load on the Lucene server, so Lucene makes it easy to disable certain types of queries while allowing all others to proceed through the system. This gives programmers better control and allows the system performance to be more predictable. EPL660 25

Field-specific Queries For example title: star wars AND director: George Lucas EPL660 26

Sorting Can sort any field in a Document For example, by Price, Release Date, Amazon Sales Rank, etc By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported. EPL660 27

Documents A document can represent anything textual: Word Document DVD (the textual metadata only) Website Member (name, ID, etc ) A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database. Each developer is responsible for turning their own data sets into Lucene Documents. Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files. EPL660 28

Indexes Lucene employs inverted indexing (like most full-textbased search engines). Indexes track term frequencies. Every term maps back to a Document. This index is what allows Lucene to quickly locate every document currently associated with a given set of input search terms. EPL660 29

Basic Indexing An index consists of one or more Lucene documents. 1. Create a document: A document consists of one or more fields: name-value pair Example: A field commonly found in applications is title. In the case of a title field, the field name is title and the value is the title of that item. Add one or more fields to the document. 2. Add the document to an index: Indexing involves adding documents to an IndexWriter. 3. Indexer will analyze the Document: We can provide specialized analyzers such as StandardAnalyzer. EPL660 30

Analyzing Analyzers control how the text is broken into terms which are then used to index the document. Analyzers can be used to remove stop words and they also perform stemming. Lucene comes with a default analyzer which works well for unstructured English text, however it often performs incorrect normalizations on non-english texts. Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own. Lucene even includes a number of stemming algorithms for various languages, which can improve document retrieval accuracy when the source language is known at indexing time. EPL660 31

Basic Searching Searching requires an index to have already been built. 1. Create a Query: Usually via QueryParser, MultiPhraseQuery etc. that parse user input. 2. Open an Index: 3. Search the Index: E.g. via IndexSearcher. Use an Analyzer (as before). 4. Iterate through returned Documents: Extract out needed results. Extract out result scores (if needed). EPL660 32

Lucene as a Web Service 1. Design an HTTP query syntax GET queries XML for results 2. Wrap Tomcat around core code Tomcat is a source software implementation of the Java Servlet and JavaServer Pages technologies 3. Write a Client Library EPL660 33

Scalability Limits 3 main scalability factors: Query Rate Index Size Update Rate EPL660 34

Query Rate Scalability Lucene is already fast: Built-in simple cache mechanism Easy solution for heavy workloads: Add more query servers behind a load balancer Can grow as your traffic grows EPL660 35

Index Size Scalability Can easily handle millions of documents Lucene is very commonly deployed into systems with 10s of millions of documents. Although query performance can degrade as more documents are added to the index, the growth factor is very low. The main limits related to index size that you are likely to run into, will be disk capacity and disk I/O limits. If you need bigger index: Built-in methods to allow queries to span multiple remote Lucene indexes Can merge multiple remote indexes at query-time. EPL660 36

Lucene Installation 1. Download the latest version of Lucene (v3.0.3) from: http://www.apache.org/dyn/closer.cgi/lucene/java/ 2. Add files lucene-core-{version}.jar and lucene-demos- {version}.jar in your Java CLASSPATH. 3. Start programming! (Optional Step) 4. Go to Lucene-{version}/src/demo/org/apache/lucene/demo directory and start editing files IndexFiles.java and SearchFiles.java. EPL660 37

Useful Info Official Apache Lucene site: http://lucene.apache.org/java/docs/ Lucene-java Wiki: http://wiki.apache.org/lucenejava/frontpage?action=show&redirect=frontpageen Lucene Intro (java.net): http://today.java.net/pub/a/today/2003/07/30/luceneintro.html Lucene Tutorial.com: http://www.lucenetutorial.com/ EPL660 38