Intrafind Text Analytics. Research Department, Dr. Christoph Goller

Research Department, Dr. Christoph Goller

IntraFind Software AG

IntraFind Software AG Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Partner Network (> 30 VAR & embedding partners) Employees: 30 Lucene Committers: B. Messer, C. Goller Independent Software Vendor: Products: ifinder, Topic Finder, Knowledge Map, Tagging Service, Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Stable API Automatic Generation of Semantics Linguistic Analyzers for most European Languages Semantic Search Named Entity Recognition Text Classification Clustering www.intrafind.de/jobs

Morphological Analyzer vs. Stemming Morphological Analyzer: Lemmatizer: maps words to their base forms English going -> go (Verb) bought -> buy (Verb) bags -> bag (Noun) bacteria -> bacterium (Noun) German lief -> laufen (Verb) rannte -> rennen (Verb) Bücher -> Buch (Noun) Taschen -> Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children s book) -> Kind (Noun) Buch (Noun) Versicherungsvertrag (contract of insurance) -> Versicherung (Noun) Vertrag (Noun) Holztisch (wooden table), Glastisch (table made of glass) Stemmer: usually simple algorithm going -> go king -> k??????????? Messer -> mess??????

Bad Precision with Algorithmic Stemmer

High Recall and High Precision with Morphological Analyzer (ZEIT Online)

Qualitätsmaße 1 F-Measure ist ein gebräuchliches Maß im Bereich Document Retrieval: f-measure = 2 * recall * precision / (recall + precision) recall, precision und f-measure liegen immer im Intervall [0,1]

Measuring Recall and Precision Nouns Recall / Precision Macro Average for 40 German nouns no compound analysis Verbs Recall / Precision Macro Average for 30 German verbs no compound analysis

Implementing a Lemmatizer / Decomposer Mapping inflected forms to base forms for all lemmas of a language Finite State Techniques German lexicon: about 100,000 base forms, 700,000 inflected forms Decomposition done algorithmically: Gipfelsturm Gipfel+Sturm Gipfel+Turm Staffelei keine Zerlegung Staffel+Ei Leistungen keine Zerlegung Leis(e)+tun+Gen Messerattentat Messer+Attentat Messe+Ratten+Tat Bundessteuerbehörde Bund+Steuer+Behörde Bund+ess+teuer+Behörde

Advantages of Morphological Analysis Combines high Recall with high Precision for Search Applications Improves subsequent statistical methods Better suited as descriptions for semantic search / faceting / clustering than artificial stems, categories used for phrase detection Reliable Lookup in Lexicon Resources Thesaurus / Ontologies Cross-lingual search Available Languages: German, English, Spanish, French, Italian, Dutch, Russian, Polish, Serbo- Croatian, Greek, (Chinese, Arabian, Pasthu)

Named Entity Recognition (NER) Automated extraction of information from unstructured data People names Company names Brands from product lists Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eclass categories) Names of streets and locations Currency and accounting values Dates Phone numbers, email addresses, hyperlinks Search query: Which people / companies strongly correlate with each other? 12

Information Extraction: Named Entity Recognition Goal: Extract Named Entities from text Person-Name: Names of Persons (first and second name) Person-Role: Profession, Position Title Person: Person-Role + Person-Name Organization: Location: Org_Finance Org_Political Org_Company Org_Association Org_Official Org_Education City Region Country Continent Nature

Information Extraction: Named Entity Recognition Date, Time (absolute, relative range) Currency / Prices Metric Value (diverse numbers): Weight Length, Height, Width Area Volume Frequency Speed Temperature Address und Communication: complete Address including components (ZIP-Code, Street, Number, City): Phone, Fax, E-Mail, URL Customer-Specific, e.g.: Raw Materials Products GIS-Coordinates Skills

Named Entities: Applications in Search Applications: Facets (Demo) Search for Experts Additional query types Index structure: Additional tokens on the same position: N_PersonName N_Peter Müller N_Peter N_Müller Search for a person named Brown Search for a company near founded and Bill Gates Question Answering / Natural Language Queries (Demo)

Implementing Named Entity Recognition Technology: gazetteers, local grammars (rule based), regular expressions Gate: GUI and Jape Grammars (all other components substituted due to stability issues) Quality of German Namer: average f-score 84% on news and 78% on wikipedia

Namer Orthomatcher 17

Namer Orthomatcher 18

Namer Normalization 19

Namer Aggregation 20

Named Entities in Facets

Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wo liegen Werke von Audi? NL-Search 22

Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wer hat Microsoft gegründet? NL-Search 23

Example: Legal Entities court docket date

RDF Output and linking of Entities blue boxes = classes red = properties green = extracted texts and feature of entities

Linked Open Data

Text Clustering

Text Classification Goal: Automatically assign documents to topics based on their content Topics are defined by example documents Applications: News: Newsletter-Management-System Spam Filtering; Mail/Email Classification Product Classification (Online Shops), ECLASS /UNSPSC Subject Area Assignment for Libraries & Publishing Companies Opinion Mining / Sentiment Detection Part of our Tagging Services

Architecture of an Automatic Text Classifier Learning Phase Documents with Topic/ Class Labels Tokenizer / Analyzer Indexing Feature Extraction/ Selection Feature- Vectors of Documents with Topic Labels Classifier Parameters for Topics 1..N Pattern Recognition Method New Document Feature- Vector of Document Topic Classifier Topic Associations Classification Phase User

Lessons Learned Analysis / Tokenization: Normalization (e.g. Morphological Analyzers) and Stopwords improve Classification Feature Selection: TF*IDF, Mutual Information, Covariance / Chi Square,... Multiword Phrases, positive & negative Correlation Machine Learning: Goal: Good Generalization Avoid Overfitting: entia non sunt multiplicanda praeter necessitatem (Occam s Razor) SVM: linear is enough Don t trust blindly in Manual Classification by experts Statistics / Machine Learning Results: Test!

TopicFinder Features Training & Test GUI needed Automatically identify inconsistencies in training & test data Detection of Duplicates Similarity Search (More Like This) Automatic Testing: Cross-Validation (Multi-Threaded!) Not a Blackbox: Classification Rules have to be readable False Positive and (False Negative) Analysis Iterative Training Clustering of False Positive / False Negative

Cross-Validation: Automatically test on Data not used for Training

Example Rules: Legal Taxonomy Age Discrimination: age discrimination ^15.0 ADEA^12.2 age discrimination claim ^12.5 claim of age ^8.3 age discrimination employment act ^6.9 age race ^6.3... Discrimination: religious discrimination ^44.5 muslim^23.4 alleged religious discrimination ^17.6 title viii to religious ^23.3 god^10.2 religion^11.7 sex^-8.6 oklahoma^-13.2 christ^-7.1 jewish^8.6 bona fide religious ^18.6... Employee Status: independent contractor ^14.1 economic reality ^8.6 fair labour standard ^3.6 collective bargain ^- 2.4 Oakwood Healthcare ^12.1... Employee Leave: FMLA^9.13 medical leave 3.0 FMLA leave ^4.4 sick leave ^2.3 unpaid leave ^1.1 serious health condition ^2.1... Overtime: overtime^12.1 FLSA^9.7 pay overtime ^5.3 overtime compensation ^6.6 interference^-11.1 common policy ^8.8 modest^10.7 unpaided overtime ^7.9 hourly^4.1 forty hour ^5.7 minimum wage ^-4.3...

Pharmaceutical Newsletter: Highlighting Example

Performance Experience Training-Performance Highly efficient SVM implementations LibSVM and LibLinear Multithreaded Training with linear speedup tested up to 64 processors (IBM P7 architecture) Classification Performance Batch and Online Mode available Batch Mode for classification of big document collections: 1. Build a Lucene Index (50-100 GB/h plain text on standard hardware) 2. Execute one search for each Document Type (negligible)

TopicFinder: More Features Documents for Training and Testing are stored in a Lucene Index Information about topics is stored in a separate untokenized field Feature Selection simply consists of comparing posting lists of topics and terms from the content of the text Consistency of manual topic assignment can be checked by Using MD5-keys for duplicate checks Lucene s Similarity Search for checking for near duplicates Feature vectors are generated from Lucene posting lists Training is completely done by LibSVM / LibLinear Instead of storing support vectors, hyperplanes are stored directly

Vector-Space Model for Documents and Queries Vektor-Space Model: Document 1: The boy on the bridge Document 2: The boy plays chess Term / Document Matrix: Boy Bridge Chess the on plays Document 1 1 1 0 2 1 0 Document 2 1 0 1 2 0 1 Cosinus Similarity: Queries treated as simply very short documents Fulltext-Search : direct product of query vector with all document vectors Document-Score: Cosinus-Similarity

Hyperplane Query Hyperplane Equation: direct product of two vectors minus bias HyperplaneQuery: generalized BooleanQuery no coord, no idf, no querynorm A complete index may be classified by one simple search Classifying one document: Build an 1-document index Apply Classification Queries Many topics: Store Queries in Index (Term Boosts as Payloads) Apply Documents as Queries

Implementation Details Apache Lucene (http://lucene.apache.org/): Built in late 90 s by Doug Cutting. Apache release 2001 State of the art Java library for indexing and ranking Wide acceptance by 2005 LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) Authors: Chih-Chung Chang and Chih-Jen Lin NIPS 2003 feature selection challenge (third place). Full SVM implementation in C++ and Java License similar to the Apache License LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/): Machine Learning Group at National Taiwan University Optimized for the linear case (hyperplanes) Same License as LibSVM

ifinder: Semantic Search and Navigation Unique Search Box Result from Structured Data Facets (correlated terms) Result from Unstructured Data Facets (named entities) Facets (other meta-data)

Tagging Service: Semantic Linking Tagging Service: Generates semantic tags automatically combines: Simple Statistical Tagging (TF*IDF) with Noun Phrase Identification Named Entity Recognition Text Classification Example: Semantic Linking for Zeit Online http://www.zeit.de/schlagworte http://www.zeit.de/schlagworte/themen http://www.zeit.de/schlagworte/personen http://www.zeit.de/schlagworte/organisationen http://www.zeit.de/schlagworte/orte 41

Semantic Linking, ZEIT Online Keyword: Arbeit

Semantic Linking Keyword: Thomas Gottschalk