Intrafind Text Analytics. Research Department, Dr. Christoph Goller

Transcription

1 Research Department, Dr. Christoph Goller

2 IntraFind Software AG

3 IntraFind Software AG Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Partner Network (> 30 VAR & embedding partners) Employees: 30 Lucene Committers: B. Messer, C. Goller Independent Software Vendor: Products: ifinder, Topic Finder, Knowledge Map, Tagging Service, Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Stable API Automatic Generation of Semantics Linguistic Analyzers for most European Languages Semantic Search Named Entity Recognition Text Classification Clustering

4 Morphological Analyzer vs. Stemming Morphological Analyzer: Lemmatizer: maps words to their base forms English going -> go (Verb) bought -> buy (Verb) bags -> bag (Noun) bacteria -> bacterium (Noun) German lief -> laufen (Verb) rannte -> rennen (Verb) Bücher -> Buch (Noun) Taschen -> Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children s book) -> Kind (Noun) Buch (Noun) Versicherungsvertrag (contract of insurance) -> Versicherung (Noun) Vertrag (Noun) Holztisch (wooden table), Glastisch (table made of glass) Stemmer: usually simple algorithm going -> go king -> k??????????? Messer -> mess??????

5 Bad Precision with Algorithmic Stemmer

6 High Recall and High Precision with Morphological Analyzer (ZEIT Online)

7 High Recall and High Precision with Morphological Analyzer (ZEIT Online)

8 Qualitätsmaße 1 F-Measure ist ein gebräuchliches Maß im Bereich Document Retrieval: f-measure = 2 * recall * precision / (recall + precision) recall, precision und f-measure liegen immer im Intervall [0,1]

9 Measuring Recall and Precision Nouns Recall / Precision Macro Average for 40 German nouns no compound analysis Verbs Recall / Precision Macro Average for 30 German verbs no compound analysis

10 Implementing a Lemmatizer / Decomposer Mapping inflected forms to base forms for all lemmas of a language Finite State Techniques German lexicon: about 100,000 base forms, 700,000 inflected forms Decomposition done algorithmically: Gipfelsturm Gipfel+Sturm Gipfel+Turm Staffelei keine Zerlegung Staffel+Ei Leistungen keine Zerlegung Leis(e)+tun+Gen Messerattentat Messer+Attentat Messe+Ratten+Tat Bundessteuerbehörde Bund+Steuer+Behörde Bund+ess+teuer+Behörde

11 Advantages of Morphological Analysis Combines high Recall with high Precision for Search Applications Improves subsequent statistical methods Better suited as descriptions for semantic search / faceting / clustering than artificial stems, categories used for phrase detection Reliable Lookup in Lexicon Resources Thesaurus / Ontologies Cross-lingual search Available Languages: German, English, Spanish, French, Italian, Dutch, Russian, Polish, Serbo- Croatian, Greek, (Chinese, Arabian, Pasthu)

12 Named Entity Recognition (NER) Automated extraction of information from unstructured data People names Company names Brands from product lists Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eclass categories) Names of streets and locations Currency and accounting values Dates Phone numbers, addresses, hyperlinks Search query: Which people / companies strongly correlate with each other? 12

13 Information Extraction: Named Entity Recognition Goal: Extract Named Entities from text Person-Name: Names of Persons (first and second name) Person-Role: Profession, Position Title Person: Person-Role + Person-Name Organization: Location: Org_Finance Org_Political Org_Company Org_Association Org_Official Org_Education City Region Country Continent Nature

14 Information Extraction: Named Entity Recognition Date, Time (absolute, relative range) Currency / Prices Metric Value (diverse numbers): Weight Length, Height, Width Area Volume Frequency Speed Temperature Address und Communication: complete Address including components (ZIP-Code, Street, Number, City): Phone, Fax, , URL Customer-Specific, e.g.: Raw Materials Products GIS-Coordinates Skills

15 Named Entities: Applications in Search Applications: Facets (Demo) Search for Experts Additional query types Index structure: Additional tokens on the same position: N_PersonName N_Peter Müller N_Peter N_Müller Search for a person named Brown Search for a company near founded and Bill Gates Question Answering / Natural Language Queries (Demo)

16 Implementing Named Entity Recognition Technology: gazetteers, local grammars (rule based), regular expressions Gate: GUI and Jape Grammars (all other components substituted due to stability issues) Quality of German Namer: average f-score 84% on news and 78% on wikipedia

17 Namer Orthomatcher 17

18 Namer Orthomatcher 18

19 Namer Normalization 19

20 Namer Aggregation 20

21 Named Entities in Facets

22 Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wo liegen Werke von Audi? NL-Search 22

23 Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wer hat Microsoft gegründet? NL-Search 23

24 Example: Legal Entities court docket date

25 RDF Output and linking of Entities blue boxes = classes red = properties green = extracted texts and feature of entities

26 Linked Open Data

27 Text Clustering

28 Text Classification Goal: Automatically assign documents to topics based on their content Topics are defined by example documents Applications: News: Newsletter-Management-System Spam Filtering; Mail/ Classification Product Classification (Online Shops), ECLASS /UNSPSC Subject Area Assignment for Libraries & Publishing Companies Opinion Mining / Sentiment Detection Part of our Tagging Services

29 Architecture of an Automatic Text Classifier Learning Phase Documents with Topic/ Class Labels Tokenizer / Analyzer Indexing Feature Extraction/ Selection Feature- Vectors of Documents with Topic Labels Classifier Parameters for Topics 1..N Pattern Recognition Method New Document Feature- Vector of Document Topic Classifier Topic Associations Classification Phase User

30 Lessons Learned Analysis / Tokenization: Normalization (e.g. Morphological Analyzers) and Stopwords improve Classification Feature Selection: TF*IDF, Mutual Information, Covariance / Chi Square,... Multiword Phrases, positive & negative Correlation Machine Learning: Goal: Good Generalization Avoid Overfitting: entia non sunt multiplicanda praeter necessitatem (Occam s Razor) SVM: linear is enough Don t trust blindly in Manual Classification by experts Statistics / Machine Learning Results: Test!

31 TopicFinder Features Training & Test GUI needed Automatically identify inconsistencies in training & test data Detection of Duplicates Similarity Search (More Like This) Automatic Testing: Cross-Validation (Multi-Threaded!) Not a Blackbox: Classification Rules have to be readable False Positive and (False Negative) Analysis Iterative Training Clustering of False Positive / False Negative

32 Cross-Validation: Automatically test on Data not used for Training

33 Example Rules: Legal Taxonomy Age Discrimination: age discrimination ^15.0 ADEA^12.2 age discrimination claim ^12.5 claim of age ^8.3 age discrimination employment act ^6.9 age race ^ Discrimination: religious discrimination ^44.5 muslim^23.4 alleged religious discrimination ^17.6 title viii to religious ^23.3 god^10.2 religion^11.7 sex^-8.6 oklahoma^-13.2 christ^-7.1 jewish^8.6 bona fide religious ^ Employee Status: independent contractor ^14.1 economic reality ^8.6 fair labour standard ^3.6 collective bargain ^- 2.4 Oakwood Healthcare ^ Employee Leave: FMLA^9.13 medical leave 3.0 FMLA leave ^4.4 sick leave ^2.3 unpaid leave ^1.1 serious health condition ^ Overtime: overtime^12.1 FLSA^9.7 pay overtime ^5.3 overtime compensation ^6.6 interference^-11.1 common policy ^8.8 modest^10.7 unpaided overtime ^7.9 hourly^4.1 forty hour ^5.7 minimum wage ^

34 Pharmaceutical Newsletter: Highlighting Example

35 Performance Experience Training-Performance Highly efficient SVM implementations LibSVM and LibLinear Multithreaded Training with linear speedup tested up to 64 processors (IBM P7 architecture) Classification Performance Batch and Online Mode available Batch Mode for classification of big document collections: 1. Build a Lucene Index ( GB/h plain text on standard hardware) 2. Execute one search for each Document Type (negligible)

36 TopicFinder: More Features Documents for Training and Testing are stored in a Lucene Index Information about topics is stored in a separate untokenized field Feature Selection simply consists of comparing posting lists of topics and terms from the content of the text Consistency of manual topic assignment can be checked by Using MD5-keys for duplicate checks Lucene s Similarity Search for checking for near duplicates Feature vectors are generated from Lucene posting lists Training is completely done by LibSVM / LibLinear Instead of storing support vectors, hyperplanes are stored directly

37 Vector-Space Model for Documents and Queries Vektor-Space Model: Document 1: The boy on the bridge Document 2: The boy plays chess Term / Document Matrix: Boy Bridge Chess the on plays Document Document Cosinus Similarity: Queries treated as simply very short documents Fulltext-Search : direct product of query vector with all document vectors Document-Score: Cosinus-Similarity

38 Hyperplane Query Hyperplane Equation: direct product of two vectors minus bias HyperplaneQuery: generalized BooleanQuery no coord, no idf, no querynorm A complete index may be classified by one simple search Classifying one document: Build an 1-document index Apply Classification Queries Many topics: Store Queries in Index (Term Boosts as Payloads) Apply Documents as Queries

39 Implementation Details Apache Lucene ( Built in late 90 s by Doug Cutting. Apache release 2001 State of the art Java library for indexing and ranking Wide acceptance by 2005 LibSVM ( Authors: Chih-Chung Chang and Chih-Jen Lin NIPS 2003 feature selection challenge (third place). Full SVM implementation in C++ and Java License similar to the Apache License LibLinear ( Machine Learning Group at National Taiwan University Optimized for the linear case (hyperplanes) Same License as LibSVM

40 ifinder: Semantic Search and Navigation Unique Search Box Result from Structured Data Facets (correlated terms) Result from Unstructured Data Facets (named entities) Facets (other meta-data)

41 Tagging Service: Semantic Linking Tagging Service: Generates semantic tags automatically combines: Simple Statistical Tagging (TF*IDF) with Noun Phrase Identification Named Entity Recognition Text Classification Example: Semantic Linking for Zeit Online

42 Semantic Linking, ZEIT Online Keyword: Arbeit

43 Semantic Linking Keyword: Thomas Gottschalk