Intrafind Text Analytics. Research Department, Dr. Christoph Goller

Similar documents
Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Text Classification based on Lucene and LibSVM / LibLinear. Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG

Solr-based Search & Automatic Tagging at Zeit Online where Meta Data come from. ApacheCon Europe 2012 Dr. Christoph Goller, IntraFind Software AG

ifinder ENTERPRISE SEARCH

Information Retrieval Elasticsearch

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

IT services for analyses of various data samples

Active Learning SVM for Blogs recommendation

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Search and Information Retrieval

Mining a Corpus of Job Ads

Content-Based Recommendation

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Text Analytics Software Choosing the Right Fit

Distributed Computing and Big Data: Hadoop and MapReduce

Text Analytics Evaluation Case Study - Amdocs

Big Data Text Mining and Visualization. Anton Heijs

Why are Organizations Interested?

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Azure Machine Learning, SQL Data Mining and R

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Mining Text Data: An Introduction

PROMT Technologies for Translation and Big Data

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification

Maschinelles Lernen mit MATLAB

Introduction to Text Mining. Module 2: Information Extraction in GATE

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Can Twitter provide enough information for predicting the stock market?

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Automated Content Analysis of Discussion Transcripts

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Removing Web Spam Links from Search Engine Results

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Flattening Enterprise Knowledge

Author Gender Identification of English Novels

Statistical Feature Selection Techniques for Arabic Text Categorization

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Best Practices for Architecting Taxonomy and Metadata in an Open Source Environment

Search Engines. Stephen Shaw 18th of February, Netsoc

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Fast Analytics on Big Data with H20

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Map-Reduce for Machine Learning on Multicore

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Enhancing Lotus Domino search

Semantic annotation of requirements for automatic UML class diagram generation

Social Media Mining. Data Mining Essentials

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Analysis Tools and Libraries for BigData

Big Data Analytics and Healthcare

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Delivering Smart Answers!

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Advas A Python Search Engine Module

Information Retrieval System Assigning Context to Documents by Relevance Feedback

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

How To Write A Summary Of A Review

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Optimizing Multilingual Search With Solr

High Productivity Data Processing Analytics Methods with Applications

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Pattern-Aided Regression Modelling and Prediction Model Analysis

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

An Information Retrieval using weighted Index Terms in Natural Language document collections

Knowledge Discovery from patents using KMX Text Analytics

Experiments in Web Page Classification for Semantic Web

Data Mining Yelp Data - Predicting rating stars from review text

Equity forecast: Predicting long term stock price movement using machine learning

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Automatic Text Processing: Cross-Lingual. Text Categorization

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Sentiment analysis on tweets in a financial domain

Projektgruppe. Categorization of text documents via classification

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

DBpedia German: Extensions and Applications

Transcription:

Research Department, Dr. Christoph Goller

IntraFind Software AG

IntraFind Software AG Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Partner Network (> 30 VAR & embedding partners) Employees: 30 Lucene Committers: B. Messer, C. Goller Independent Software Vendor: Products: ifinder, Topic Finder, Knowledge Map, Tagging Service, Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Stable API Automatic Generation of Semantics Linguistic Analyzers for most European Languages Semantic Search Named Entity Recognition Text Classification Clustering www.intrafind.de/jobs

Morphological Analyzer vs. Stemming Morphological Analyzer: Lemmatizer: maps words to their base forms English going -> go (Verb) bought -> buy (Verb) bags -> bag (Noun) bacteria -> bacterium (Noun) German lief -> laufen (Verb) rannte -> rennen (Verb) Bücher -> Buch (Noun) Taschen -> Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children s book) -> Kind (Noun) Buch (Noun) Versicherungsvertrag (contract of insurance) -> Versicherung (Noun) Vertrag (Noun) Holztisch (wooden table), Glastisch (table made of glass) Stemmer: usually simple algorithm going -> go king -> k??????????? Messer -> mess??????

Bad Precision with Algorithmic Stemmer

High Recall and High Precision with Morphological Analyzer (ZEIT Online)

High Recall and High Precision with Morphological Analyzer (ZEIT Online)

Qualitätsmaße 1 F-Measure ist ein gebräuchliches Maß im Bereich Document Retrieval: f-measure = 2 * recall * precision / (recall + precision) recall, precision und f-measure liegen immer im Intervall [0,1]

Measuring Recall and Precision Nouns Recall / Precision Macro Average for 40 German nouns no compound analysis Verbs Recall / Precision Macro Average for 30 German verbs no compound analysis

Implementing a Lemmatizer / Decomposer Mapping inflected forms to base forms for all lemmas of a language Finite State Techniques German lexicon: about 100,000 base forms, 700,000 inflected forms Decomposition done algorithmically: Gipfelsturm Gipfel+Sturm Gipfel+Turm Staffelei keine Zerlegung Staffel+Ei Leistungen keine Zerlegung Leis(e)+tun+Gen Messerattentat Messer+Attentat Messe+Ratten+Tat Bundessteuerbehörde Bund+Steuer+Behörde Bund+ess+teuer+Behörde

Advantages of Morphological Analysis Combines high Recall with high Precision for Search Applications Improves subsequent statistical methods Better suited as descriptions for semantic search / faceting / clustering than artificial stems, categories used for phrase detection Reliable Lookup in Lexicon Resources Thesaurus / Ontologies Cross-lingual search Available Languages: German, English, Spanish, French, Italian, Dutch, Russian, Polish, Serbo- Croatian, Greek, (Chinese, Arabian, Pasthu)

Named Entity Recognition (NER) Automated extraction of information from unstructured data People names Company names Brands from product lists Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eclass categories) Names of streets and locations Currency and accounting values Dates Phone numbers, email addresses, hyperlinks Search query: Which people / companies strongly correlate with each other? 12

Information Extraction: Named Entity Recognition Goal: Extract Named Entities from text Person-Name: Names of Persons (first and second name) Person-Role: Profession, Position Title Person: Person-Role + Person-Name Organization: Location: Org_Finance Org_Political Org_Company Org_Association Org_Official Org_Education City Region Country Continent Nature

Information Extraction: Named Entity Recognition Date, Time (absolute, relative range) Currency / Prices Metric Value (diverse numbers): Weight Length, Height, Width Area Volume Frequency Speed Temperature Address und Communication: complete Address including components (ZIP-Code, Street, Number, City): Phone, Fax, E-Mail, URL Customer-Specific, e.g.: Raw Materials Products GIS-Coordinates Skills

Named Entities: Applications in Search Applications: Facets (Demo) Search for Experts Additional query types Index structure: Additional tokens on the same position: N_PersonName N_Peter Müller N_Peter N_Müller Search for a person named Brown Search for a company near founded and Bill Gates Question Answering / Natural Language Queries (Demo)

Implementing Named Entity Recognition Technology: gazetteers, local grammars (rule based), regular expressions Gate: GUI and Jape Grammars (all other components substituted due to stability issues) Quality of German Namer: average f-score 84% on news and 78% on wikipedia

Namer Orthomatcher 17

Namer Orthomatcher 18

Namer Normalization 19

Namer Aggregation 20

Named Entities in Facets

Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wo liegen Werke von Audi? NL-Search 22

Semantic Search / Question Answering: Comparison with standard Search Engines Frage: Wer hat Microsoft gegründet? NL-Search 23

Example: Legal Entities court docket date

RDF Output and linking of Entities blue boxes = classes red = properties green = extracted texts and feature of entities

Linked Open Data

Text Clustering

Text Classification Goal: Automatically assign documents to topics based on their content Topics are defined by example documents Applications: News: Newsletter-Management-System Spam Filtering; Mail/Email Classification Product Classification (Online Shops), ECLASS /UNSPSC Subject Area Assignment for Libraries & Publishing Companies Opinion Mining / Sentiment Detection Part of our Tagging Services

Architecture of an Automatic Text Classifier Learning Phase Documents with Topic/ Class Labels Tokenizer / Analyzer Indexing Feature Extraction/ Selection Feature- Vectors of Documents with Topic Labels Classifier Parameters for Topics 1..N Pattern Recognition Method New Document Feature- Vector of Document Topic Classifier Topic Associations Classification Phase User

Lessons Learned Analysis / Tokenization: Normalization (e.g. Morphological Analyzers) and Stopwords improve Classification Feature Selection: TF*IDF, Mutual Information, Covariance / Chi Square,... Multiword Phrases, positive & negative Correlation Machine Learning: Goal: Good Generalization Avoid Overfitting: entia non sunt multiplicanda praeter necessitatem (Occam s Razor) SVM: linear is enough Don t trust blindly in Manual Classification by experts Statistics / Machine Learning Results: Test!

TopicFinder Features Training & Test GUI needed Automatically identify inconsistencies in training & test data Detection of Duplicates Similarity Search (More Like This) Automatic Testing: Cross-Validation (Multi-Threaded!) Not a Blackbox: Classification Rules have to be readable False Positive and (False Negative) Analysis Iterative Training Clustering of False Positive / False Negative

Cross-Validation: Automatically test on Data not used for Training

Example Rules: Legal Taxonomy Age Discrimination: age discrimination ^15.0 ADEA^12.2 age discrimination claim ^12.5 claim of age ^8.3 age discrimination employment act ^6.9 age race ^6.3... Discrimination: religious discrimination ^44.5 muslim^23.4 alleged religious discrimination ^17.6 title viii to religious ^23.3 god^10.2 religion^11.7 sex^-8.6 oklahoma^-13.2 christ^-7.1 jewish^8.6 bona fide religious ^18.6... Employee Status: independent contractor ^14.1 economic reality ^8.6 fair labour standard ^3.6 collective bargain ^- 2.4 Oakwood Healthcare ^12.1... Employee Leave: FMLA^9.13 medical leave 3.0 FMLA leave ^4.4 sick leave ^2.3 unpaid leave ^1.1 serious health condition ^2.1... Overtime: overtime^12.1 FLSA^9.7 pay overtime ^5.3 overtime compensation ^6.6 interference^-11.1 common policy ^8.8 modest^10.7 unpaided overtime ^7.9 hourly^4.1 forty hour ^5.7 minimum wage ^-4.3...

Pharmaceutical Newsletter: Highlighting Example

Performance Experience Training-Performance Highly efficient SVM implementations LibSVM and LibLinear Multithreaded Training with linear speedup tested up to 64 processors (IBM P7 architecture) Classification Performance Batch and Online Mode available Batch Mode for classification of big document collections: 1. Build a Lucene Index (50-100 GB/h plain text on standard hardware) 2. Execute one search for each Document Type (negligible)

TopicFinder: More Features Documents for Training and Testing are stored in a Lucene Index Information about topics is stored in a separate untokenized field Feature Selection simply consists of comparing posting lists of topics and terms from the content of the text Consistency of manual topic assignment can be checked by Using MD5-keys for duplicate checks Lucene s Similarity Search for checking for near duplicates Feature vectors are generated from Lucene posting lists Training is completely done by LibSVM / LibLinear Instead of storing support vectors, hyperplanes are stored directly

Vector-Space Model for Documents and Queries Vektor-Space Model: Document 1: The boy on the bridge Document 2: The boy plays chess Term / Document Matrix: Boy Bridge Chess the on plays Document 1 1 1 0 2 1 0 Document 2 1 0 1 2 0 1 Cosinus Similarity: Queries treated as simply very short documents Fulltext-Search : direct product of query vector with all document vectors Document-Score: Cosinus-Similarity

Hyperplane Query Hyperplane Equation: direct product of two vectors minus bias HyperplaneQuery: generalized BooleanQuery no coord, no idf, no querynorm A complete index may be classified by one simple search Classifying one document: Build an 1-document index Apply Classification Queries Many topics: Store Queries in Index (Term Boosts as Payloads) Apply Documents as Queries

Implementation Details Apache Lucene (http://lucene.apache.org/): Built in late 90 s by Doug Cutting. Apache release 2001 State of the art Java library for indexing and ranking Wide acceptance by 2005 LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) Authors: Chih-Chung Chang and Chih-Jen Lin NIPS 2003 feature selection challenge (third place). Full SVM implementation in C++ and Java License similar to the Apache License LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/): Machine Learning Group at National Taiwan University Optimized for the linear case (hyperplanes) Same License as LibSVM

ifinder: Semantic Search and Navigation Unique Search Box Result from Structured Data Facets (correlated terms) Result from Unstructured Data Facets (named entities) Facets (other meta-data)

Tagging Service: Semantic Linking Tagging Service: Generates semantic tags automatically combines: Simple Statistical Tagging (TF*IDF) with Noun Phrase Identification Named Entity Recognition Text Classification Example: Semantic Linking for Zeit Online http://www.zeit.de/schlagworte http://www.zeit.de/schlagworte/themen http://www.zeit.de/schlagworte/personen http://www.zeit.de/schlagworte/organisationen http://www.zeit.de/schlagworte/orte 41

Semantic Linking, ZEIT Online Keyword: Arbeit

Semantic Linking Keyword: Thomas Gottschalk