Solr-based Search & Automatic Tagging at Zeit Online where Meta Data come from. ApacheCon Europe 2012 Dr. Christoph Goller, IntraFind Software AG

Similar documents
Text Classification based on Lucene and LibSVM / LibLinear. Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Intrafind Text Analytics. Research Department, Dr. Christoph Goller

ifinder ENTERPRISE SEARCH

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Optimizing Multilingual Search With Solr

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Search and Information Retrieval

DBpedia German: Extensions and Applications

Introduction to Text Mining. Module 2: Information Extraction in GATE

PROMT Technologies for Translation and Big Data

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Content-Based Recommendation

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

31 Case Studies: Java Natural Language Tools Available on the Web

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

How RAI's Hyper Media News aggregation system keeps staff on top of the news

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Flattening Enterprise Knowledge

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

IT services for analyses of various data samples

How To Make Sense Of Data With Altilia

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Why are Organizations Interested?

Special Topics in Computer Science

Semantic annotation of requirements for automatic UML class diagram generation

Exalead CloudView. Semantics Whitepaper

Technical Report. The KNIME Text Processing Feature:

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Natural Language Processing in the EHR Lifecycle

CENG 734 Advanced Topics in Bioinformatics

Projektgruppe. Categorization of text documents via classification

Semantic Search. An introduction to Semantic Search in Microsoft SQL Server Seminar Thesis

Active Learning SVM for Blogs recommendation

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Big Data Text Mining and Visualization. Anton Heijs

Distributed Computing and Big Data: Hadoop and MapReduce

Unifying Search for the Desktop, the Enterprise and the Web

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

Analysis of Web Archives. Vinay Goel Senior Data Engineer

How To Manage Your Digital Assets On A Computer Or Tablet Device

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Information Retrieval Elasticsearch

Comprendium Translator System Overview

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Knowledge Discovery from patents using KMX Text Analytics

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

ANALYTICS IN BIG DATA ERA

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Enhancing Lotus Domino search

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Semantic Content Management with Apache Stanbol

Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Full-text Search in Intermediate Data Storage of FCART

Statistical Feature Selection Techniques for Arabic Text Categorization

Sentiment analysis on tweets in a financial domain

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Leveraging the power of UNSPSC for Business Intelligence

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

AccuRead OCR. Administrator's Guide

Joint ICTP-IAEA School of Nuclear Knowledge Management. 1-5 September Improving Organizational Performance with a KM System

CONCEPTCLASSIFIER FOR SHAREPOINT

Text Analytics Software Choosing the Right Fit

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Transcription:

Solr-based Search & Automatic Tagging at Zeit Online where Meta Data come from ApacheCon Europe 2012 Dr. Christoph Goller, IntraFind Software AG

IntraFind Software AG Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 2

IntraFind Software AG Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Partner Network (> 30 VAR & embedding partners) Employees: 30 Lucene Committers: B. Messer, C. Goller Our Open Source Search Business: Product Company: ifinder, Topic Finder, Knowledge Map, Tagging Service, Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Stable API Automatic Generation of Semantics Linguistic Analyzers for most European Languages Semantic Search Named Entity Recognition Text Classification Clustering www.intrafind.de/jobs Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 3

Semantic Search @ Zeit Online DIE ZEIT : a German national weekly newspaper Free Access to 60 years of Print and Online publications, roughly 500.000 articles Outline: Linguistically enhanced Search based on Solr using Intrafind Morphological Analyzers Automatic Tagging (Semantic Annotation) of Documents based on Intrafind Tagging Service Statistical Keyword Extraction Named Entity Recognition Text Classification Future Improvements: Automatic Linking to Open Data using Apache Stanbol Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 4

Analysis / Tokenization Break stream of characters into tokens /terms Normalization (e.g. case) Stop Words Stemming Lemmatizer / Decomposer Part of Speech Tagger Information Extraction Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 5

Morphological Analyzer vs. Stemming Morphological Analyzer: Lemmatizer: maps words to their base forms English German going go (Verb) lief laufen (Verb) bought buy (Verb) rannte rennen (Verb) bags bag (Noun) Bücher Buch (Noun) bacteria bacterium (Noun) Taschen Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children s book) Kind (Noun) Buch (Noun) Versicherungsvertrag (insurance contract) Versicherung (Noun) Vertrag (Noun) Holztisch (wooden table), Glastisch (glass table) Stemmer: usually simple algorithm going go king k??????????? Messer mess?????? Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 6

Implementing a Lemmatizer / Decomposer Mapping inflected forms to base forms for all lemmas of a language Finite State Techniques German lexicon: about 100,000 base forms, 700,000 inflected forms Decomposition done algorithmically: Gipfelsturm Gipfel+Sturm Gipfel+Turm Staffelei keine Zerlegung Staffel+Ei Leistungen keine Zerlegung Leis(e)+tun+Gen Messerattentat Messer+Attentat Messe+Ratten+Tat Bundessteuerbehörde Bund+Steuer+Behörde Bund+ess+teuer+Behörde Available Languages: German, English, Spanish, French, Italian, Dutch, Russian, Polish, Serbo-Croatian, Greek, (Chinese, Japanese, Arabian, Pasthu) Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 7

Advantages of Morphological Analysis Combines high Recall with high Precision for Search Applications Improves subsequent statistical methods Better suited as descriptions for faceting / clustering / autocomplete / spelling corrections than artificial stems Reliable lookup in lexicon resources Thesaurus / Ontologies Cross-lingual search Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 8

Measuring Recall and Precision Nouns Recall / Precision Macro Average for 40 German nouns no compound analysis Verbs Recall / Precision Macro Average for 30 German verbs no compound analysis Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 9

Bad Precision with Algorithmic Stemmer Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 10

High Recall and High Precision with Morphological Analyzer Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 11

High Recall and High Precision with Morphological Analyzer Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 12

Solr Configuration: schema.xml <fieldtype name="text-if" class="solr.textfield" positionincrementgap="100"> <analyzer type="index" class="org.apache.solr.analysis.intrafindlisaanalyzerdeindex"/> <analyzer type="query" class="org.apache.solr.analysis.intrafindlisaanalyzerdesearch"/> </fieldtype> Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 13

Solr Configuration: solrconfig.xml <queryparser name="intrafindqueryparser" class= "org.apache.solr.analysis. IntrafindQParserPlugin"> <lst name="generalconfig"> </lst> <float name="linguisticboost">5.0f</float> <bool name="disambiguationoncase">false</bool> <bool name="disambiguationonbaseequality">true</bool> <lst name="compositatreatment"> </lst> </queryparser> <bool name="incompositasearch">true</bool> <int name="compositasloppyness">3</int> <float name="boostexact">1.5f</float> Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 14

Statistical Keyword Extraction Extract most important keywords of a document using TF*IDF measure Identify Phrases Use POS (part of speech) tag patterns to identify good noun phrases Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 15

Named Entity Recognition (NER) Automated extraction of information from unstructured data People names Company names Brands from product lists Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eclass categories) Names of streets and locations Currency and accounting values Dates Phone numbers, email addresses, hyperlinks Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 16

Named Entities: Applications Facets Search for Experts Additional Query Types Index Structure: Additional Tokens on the same position: N_PersonName N_Peter Müller Search for a person named Brown (Semantic Search) Question Answering / Natural Language Queries Search for a company near founded and Bill Gates Part of our Tagging Services Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 17

Semantic Search: Comparison with standard Search Engines Frage: Wo liegen Werke von Audi? NL-Search Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 18

Semantic Search: Comparison with standard Search Engines Frage: Wer hat Microsoft gegründet? NL-Search Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 19

Implementing Named Entity Recognition Technology: gazetteers, local grammars (rule based), regular expressions Gate: open source platform for NLP (gate.ac.uk) GUI and Jape Grammars (all other components substituted due to stability issues) Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 20

Namer Orthomatcher Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 21

Namer Orthomatcher Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 22

Namer Normalization Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 23

Namer Aggregation Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 24

Text Classification Goal: Automatically assign documents to topics based on their content. Topics are defined by example documents. Applications: News: Newsletter-Management System Spam-Filtering; Mail / Email Classification Product Classification (Online Shops), ECLASS /UNSPSC Subject Area Assignment for Libraries & Publishing Companies Opinion Mining / Sentiment Detection Part of our Tagging Services Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 25

Text Classification Workflow Learning Phase Documents with Topic/ Class Labels Tokenizer / Analyzer Indexing Feature Extraction/ Selection Feature- Vectors of Documents with Topic Labels Classifier Parameters for Topics 1..N Pattern Recognition Method New Document Feature- Vector of Document Topic Classifier Topic Associations Classification Phase User Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 26

Lessons Learned Analysis / Tokenization: Normalization (e.g. Morphological Analyzers) and Stopwords improve classification Feature Selection: TF*IDF, Mutual Information, Covariance / Chi Square,... Multiword Phrases, positive & negative correlation Machine Learning: Goal: Good Generalization Avoid Overfitting: entia non sunt multiplicanda praeter necessitatem (Occam s Razor) SVM: linear is enough Don t trust blindly in Manual Classification by Experts Statistics / Machine Learning Results: Test! Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 27

Required Features Training & Test GUI needed Automatically identify inconsistencies in training & test data Duplicates detection Similarity Search (More Like This) Automatic Testing: Cross-Validation (Multi-Threaded!) Classification Rules have to be readable False Positive and (False Negative) Analysis, Iterative Training Clustering of False Positive / False Negative Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 28

Product Classification: Example Rules Server: einbauschächte^24.7 speicherspezifikation^22.1 tastatur^-0.7 monitortyp^21.5 socket^-9.2-1.15 Workstation: monitortyp^28.8 arbeitsstation^38.8 cpu^0.1 tower^8.9 barebone^35.8 audio^3.7 eingang^5.2 out^6.5 core^9.0 agp^5.2-2.1 PC: kleinbetrieb^7.9 personal^18.3 db-25^2.2 technology^5.6 cache^10.0 arbeitsstation^-28.1 dynamic^7.4 bereitgestelltes^25.7 dmi^5.5 ata-100^13.7 socket^6.2 wireless^2.5 16x^10.0 1/2h^13.1 nvidia^1.0 din^4.6 tasten^13.4 international^7.2 802.1p^8.1 level^- 4.4-1.5 Notebook: eingabeperipheriegeräte^64.0 1.3 Tablet PC: tc4200^16.4 tablet^6.9 konvertibel^10.6 multibay^4.6 itu^3.3 abb^2.7 digitalstift^8.5 flugzeug^1.8 1.75 Handheld: bildschirmauflösung^39.8 smartphone^8.1 ram^0.29 speicherkarten^0.53 telefon^0.35-1.4 Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 29

Pharmaceutical Newsletter: Highlighting Example Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 30

Implementation Details Training- and Test Documents are stored in a Lucene Index Information about topics is stored in a separate untokenized field Feature Selection simply consists of comparing posting lists of topics and terms form the text-content Consistency of manual topic-assignement can be checked by using MD5-Keys for duplicates checks Lucene s Similarity Search for checking for near duplicates Feature vectors are generated from Lucene posting lists Training is completely done by LibSVM / LibLinear www.csie.ntu.edu.tw/~cjlin/libsvm www.csie.ntu.edu.tw/~cjlin/liblinear Instead of storing support vectors, hyperplanes are stored directly Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 31

Tagging Service: Semantic Linking Tagging Service: Generates semantic tags automatically combines: Simple Statistical Tagging (TF*IDF) with Noun Phrase Identification Named Entity Recognition Text Classification allows: Blacklists / Whitelists / BoostingLists Example: Semantic Linking for Zeit Online http://www.zeit.de/schlagworte http://www.zeit.de/schlagworte/themen http://www.zeit.de/schlagworte/personen http://www.zeit.de/schlagworte/organisationen http://www.zeit.de/schlagworte/orte Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 32

Tagging Service at ZEIT Online Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 33

Tagging Service at ZEIT Online Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 34

Semantic Web, RDF Stores & Linked Data Semantic Web proposed by Tim Berners-Lee, the founder of the WWW Idea: Computers should be able to evaluate information according to its meanings connect information reason with it (inference), generate new information RDF Stores: Originally designed as Meta-Data model: machine-readable information Triples: subject-predicate-object General Data Model for Knowledge Representation Query and Inference Languages: SPARQL Linked Open Data: method of publishing structured data so that it can be interlinked Uses RDF Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 35

Linked Open Data Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 36

Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 37

Questions? Dr. Christoph Goller Director Research Phone: +49 89 3090446-0 Fax: +49 89 3090446-29 Email: christoph.goller@intrafind.de Web: www.intrafind.de IntraFindSoftware AG Landsberger Straße 368 80687 München Germany www.intrafind.de/jobs Solr-based Search & Automatic Tagging at Zeit Online: where Meta Data come from 38