Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications



Similar documents
Intrafind Text Analytics. Research Department, Dr. Christoph Goller

Solr-based Search & Automatic Tagging at Zeit Online where Meta Data come from. ApacheCon Europe 2012 Dr. Christoph Goller, IntraFind Software AG

ifinder ENTERPRISE SEARCH

Text Classification based on Lucene and LibSVM / LibLinear. Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Optimizing Multilingual Search With Solr

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Search and Information Retrieval

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Comprendium Translator System Overview

Technical Report. The KNIME Text Processing Feature:

Introduction to Text Mining. Module 2: Information Extraction in GATE

Text Analytics Evaluation Case Study - Amdocs

Word Completion and Prediction in Hebrew

App Development with Talkamatic Dialogue Manager

PROMT Technologies for Translation and Big Data

Advas A Python Search Engine Module

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Search Engines. Stephen Shaw 18th of February, Netsoc

Semantic annotation of requirements for automatic UML class diagram generation

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Exalead CloudView. Semantics Whitepaper

Chapter 8. Final Results on Dutch Senseval-2 Test Data

DBpedia German: Extensions and Applications

Why are Organizations Interested?

Text Analytics with Ambiverse. Text to Knowledge.

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Activities. but I will require that groups present research papers

Special Topics in Computer Science

Information Retrieval Elasticsearch

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

IT services for analyses of various data samples

Computer Aided Document Indexing System

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Unifying Search for the Desktop, the Enterprise and the Web

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

How Strings are Stored. Searching Text. Setting. ANSI_PADDING Setting

GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

WebLicht: Web-based LRT services for German

NoSQL Roadshow Berlin Kai Spichale

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

31 Case Studies: Java Natural Language Tools Available on the Web

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

Live Office. Personal Archive User Guide

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Semantic Search. An introduction to Semantic Search in Microsoft SQL Server Seminar Thesis

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

How to make Ontologies self-building from Wiki-Texts

Software Engineering EMR Project Report

Personal Archive User Guide

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Linguistic Universals

Search Engine Optimization & Social Media

How To Make Sense Of Data With Altilia

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Social Media Mining. Data Mining Essentials

ENHANCEMENT OF UDC DATA FOR USE AND SHARING IN A NETWORKED ENVIRONMENT. Aida Slavic Maria Ines Cordeiro Gerhard Riesthuis

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

SWIFT: A Text-mining Workbench for Systematic Review

Seven Stages of Pay Per Click Management

Collecting Polish German Parallel Corpora in the Internet

Overview of MT techniques. Malek Boualem (FT)

Flattening Enterprise Knowledge

Transcription:

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications Berlin Berlin Buzzwords 2011, Dr. Christoph Goller, IntraFind AG

Outline IntraFind AG Indexing Morphological Analysis What is it? How can it improve search? How is it implemented? Named Entity Recognition What is it? How can it improve search? How is it implemented? Berlin Buzzwords 2011 2

A few words about IntraFind Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Employees: 20 Lucene Committers: B. Messer, C. Goller Our Open Source Search Business: Main Product: ifinder Enterprise Search Flexibility Support & Services Relevancy, Text Analytics, Information Extraction Linguistic Analyzers for most European Languages Named Entity Recognition Text Classification Clustering www.intrafind.de/jobs Berlin Buzzwords 2011 3

Inverted Index Stores alphabetically sorted list of all terms and postings / positions for each term Documents T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" Inverted Index "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Efficient Term-Queries by iterating through posting lists Efficient Near-Queries (e.g. phrase queries, span queries) using token positions Berlin Buzzwords 2011 4

Analysis / Tokenization Break stream of characters into tokens /terms Normalization (e.g. case) Stop Words Stemming Lemmatizer / Decomposer Part of Speech Tagger Information Extraction Berlin Buzzwords 2011 5

Morphological Analyzer vs. Stemming Morphological Analyzer: Lemmatizer: maps words to their base forms English German going go (Verb) lief laufen (Verb) bought buy (Verb) rannte rennen (Verb) bags bag (Noun) Bücher Buch (Noun) bacteria bacterium (Noun) Taschen Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children s book) Kind (Noun) Buch (Noun) Versicherungsvertrag (insurance contract) Versicherung (Noun) Vertrag (Noun) Holztisch (wooden table), Glastisch (glass table) Stemmer: usually simple algorithm going go king k??????????? Messer mess?????? Berlin Buzzwords 2011 6

Advantages of Morphological Analysis Combines high Recall with high Precision for Search Applications (Demo) Improves subsequent statistical methods Better suited as descriptions for semantic search / faceting / clustering than artificial stems (Demo) Reliable lookup in lexicon resources Thesaurus / Ontologies Cross-lingual search Berlin Buzzwords 2011 7

Measuring Recall and Precision Nouns Recall / Precision Macro Average for 40 German nouns no compound analysis Verbs Recall / Precision Macro Average for 30 German verbs no compound analysis Berlin Buzzwords 2011 8

Implementing a Lemmatizer / Decomposer Mapping inflected forms to base forms for all lemmas of a language Finite State Techniques German lexicon: about 100,000 base forms, 700,000 inflected forms Decomposition done algorithmically: Gipfelsturm Gipfel+Sturm Gipfel+Turm Staffelei keine Zerlegung Staffel+Ei Leistungen keine Zerlegung Leis(e)+tun+Gen Messerattentat Messer+Attentat Messe+Ratten+Tat Bundessteuerbehörde Bund+Steuer+Behörde Bund+ess+teuer+Behörde Available Languages: German, English, Spanish, French, Italian, Dutch, Russian, Polish, Serbo-Croatian, Greek Berlin Buzzwords 2011 9

Lemmatizer / Decomposer and the Inverted Index Index stores full forms, base forms and compound parts Wildcard and Fuzzy Queries work on full forms Different Search Modes: Exact Base form search Compound search Boosting Different term-types distinguished by prefixes All terms from one original token are on the same position: Versicherungsverträge T_Versicherungsverträge B_NOUN_versicherungsvertrag C_NOUN_versicherung C_NOUN_vertrag Berlin Buzzwords 2011 10

Information Extraction: Named Entity Recognition Goal: Extract Named Entities from text Person-Name: Names of Persons (first and second name) Person-Role: Profession, Position Title Person: Person-Role + Person-Name Organization: Org_Finance Org_Political Org_Company Org_Association Org_Official Org_Education Location: City Region Country Continent Nature Berlin Buzzwords 2011 11

Information Extraction: Named Entity Recognition Date, Time (absolute, relative range) Currency / Prices Metric Value (diverse numbers): Weight Length, Height, Width Area Volume Frequency Speed Temperature Address und Communication: complete Address including components (ZIP-Code, Street, Number, City): Phone, Fax, E-Mail, URL Customer-Specific, e.g.: Raw Materials Products GIS-Coordinates Skills Berlin Buzzwords 2011 12

Named Entities: Applications in Search Applications: Facets (Demo) Search for Experts Additional query types Index structure: Additional tokens on the same position: N_PersonName N_Peter Müller N_Peter N_Müller Search for a person named Brown Search for a company near founded and Bill Gates Question Answering / Natural Language Queries (Demo) Berlin Buzzwords 2011 13

Implementing Named Entity Recognition Technology: gazetteers, local grammars (rule based), regular expressions Gate: GUI and Jape Grammars (all other components substituted due to stability issues) Quality of German Namer: average f-score 84% on news and 78% on wikipedia Berlin Buzzwords 2011 14

Namer Gate GUI Berlin Buzzwords 2011 15

Namer Orthomatcher Berlin Buzzwords 2011 16

Namer Orthomatcher Berlin Buzzwords 2011 17

Namer Normalization Berlin Buzzwords 2011 18

Namer Aggregation Berlin Buzzwords 2011 19

Questions? Dr. Christoph Goller Director Research Phone: +49 89 3090446-0 Fax: +49 89 3090446-29 Email: christoph.goller@intrafind.de Web: www.intrafind.de IntraFindSoftware AG Fraunhofer Straße 15 82152 Planegg Germany www.intrafind.de/jobs Berlin Buzzwords 2011 20