Text Classification based on Lucene and LibSVM / LibLinear. Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG



Similar documents
Intrafind Text Analytics. Research Department, Dr. Christoph Goller

Solr-based Search & Automatic Tagging at Zeit Online where Meta Data come from. ApacheCon Europe 2012 Dr. Christoph Goller, IntraFind Software AG

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

ifinder ENTERPRISE SEARCH

Content-Based Recommendation

Can Twitter provide enough information for predicting the stock market?

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Active Learning SVM for Blogs recommendation

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Information Retrieval Elasticsearch

Pattern-Aided Regression Modelling and Prediction Model Analysis

High Productivity Data Processing Analytics Methods with Applications

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Distributed Computing and Big Data: Hadoop and MapReduce

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Search and Information Retrieval

Azure Machine Learning, SQL Data Mining and R

Analysis of Web Archives. Vinay Goel Senior Data Engineer

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Automated Content Analysis of Discussion Transcripts

Map-Reduce for Machine Learning on Multicore

Statistical Feature Selection Techniques for Arabic Text Categorization

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Search Engines. Stephen Shaw 18th of February, Netsoc

Big Data Text Mining and Visualization. Anton Heijs

Sentiment analysis on tweets in a financial domain

Big-data Analytics: Challenges and Opportunities

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

An Introduction to Data Mining

Classifying Manipulation Primitives from Visual Data

extreme Datamining mit Oracle R Enterprise

Classification algorithm in Data mining: An Overview

Experiments in Web Page Classification for Semantic Web

IT services for analyses of various data samples

Machine Learning in Spam Filtering

Chapter 12 Discovering New Knowledge Data Mining

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Flattening Enterprise Knowledge

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

II. RELATED WORK. Sentiment Mining

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification

What Are They Thinking? With Oracle Application Express and Oracle Data Miner

SVM Ensemble Model for Investment Prediction

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

A Performance Analysis of Distributed Indexing using Terrier

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

The open source enterprise solution pre-configured for the IT Asset Management

Data Mining Yelp Data - Predicting rating stars from review text

Full-text Search in Intermediate Data Storage of FCART

Feature Engineering and Classifier Ensemble for KDD Cup 2010

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

use ready 2 The open source enterprise solution pre-configured for the IT Asset Management Tecnoteca Srl

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Role of Social Networking in Marketing using Data Mining

Data Mining Analysis (breast-cancer data)

Knowledge Discovery from patents using KMX Text Analytics

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Introduction to Online Learning Theory

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

ANALYTICS IN BIG DATA ERA

Some Essential Statistics The Lure of Statistics

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Veco User Guides. Document Management

4.5 Symbol Table Applications

Machine Learning using MapReduce

Mammoth Scale Machine Learning!

Maschinelles Lernen mit MATLAB

FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS

Machine learning for algo trading

Mining a Corpus of Job Ads

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Embedded Software development Process and Tools: Lesson-3 Host and Target Machines

Data Mining Practical Machine Learning Tools and Techniques

Drugs store sales forecast using Machine Learning

The basic data mining algorithms introduced may be enhanced in a number of ways.

Machine Learning with MATLAB David Willingham Application Engineer

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Music Mood Classification

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Transcription:

Text Classification based on Lucene and LibSVM / LibLinear Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG

Outline IntraFind Software AG Introduction to Text Classification What is it? Applications Lessons Learned Required Features Implementation Details Lucene, LibSVM / LibLinear Feature Selection & Training Production Phase: HyperplaneQuery Textclassification based on Lucene, LibSVM & LibLinear 2

IntraFind Software AG Textclassification based on Lucene, LibSVM & LibLinear 3

IntraFind Software AG Founding of the company: October 2000 More than 700 customers mainly in Germany, Austria, and Switzerland Partner Network (> 30 VAR & embedding partners) Employees: 30 Lucene Committers: B. Messer, C. Goller Our Open Source Search Business: Product Company: ifinder, Topic Finder, Knowledge Map, Tagging Service, Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Stable API Automatic Generation of Semantics Linguistic Analyzers for most European Languages Semantic Search Named Entity Recognition Text Classification Clustering www.intrafind.de/jobs Textclassification based on Lucene, LibSVM & LibLinear 4

Introduction to Text Classification Goal: Automatically assign documents to topics based on their content. Topics are defined by example documents. Applications: News: Newsletter-Management System Spam-Filtering; Mail / Email Classification Product Classification (Online Shops), ECLASS /UNSPSC Subject Area Assignment for Libraries & Publishing Companies Opinion Mining / Sentiment Detection Part of our Tagging Services Textclassification based on Lucene, LibSVM & LibLinear 5

Text Classification Workflow Learning Phase Documents with Topic/ Class Labels Tokenizer / Analyzer Indexing Feature Extraction/ Selection Feature- Vectors of Documents with Topic Labels Classifier Parameters for Topics 1..N Pattern Recognition Method New Document Feature- Vector of Document Topic Classifier Topic Associations Classification Phase User Textclassification based on Lucene, LibSVM & LibLinear 6

Lessons Learned Analysis / Tokenization: Normalization (e.g. Morphological Analyzers) and Stopwords improve classification Feature Selection: TF*IDF, Mutual Information, Covariance / Chi Square,... Multiword Phrases, positive & negative correlation Machine Learning: Goal: Good Generalization Avoid Overfitting: entia non sunt multiplicanda praeter necessitatem (Occam s Razor) SVM: linear is enough Don t trust blindly in Manual Classification by Experts Statistics / Machine Learning Results: Test! Textclassification based on Lucene, LibSVM & LibLinear 7

Required Features Training & Test GUI needed Automatically identify inconsistencies in training & test data Duplicates detection Similarity Search (More Like This) Automatic Testing: Cross-Validation (Multi-Threaded!) Classification Rules have to be readable False Positive and (False Negative) Analysis, Iterative Training Clustering of False Positive / False Negative Textclassification based on Lucene, LibSVM & LibLinear 8

Product Classification: Example Rules Server: einbauschächte^24.7 speicherspezifikation^22.1 tastatur^-0.7 monitortyp^21.5 socket^-9.2-1.15 Workstation: monitortyp^28.8 arbeitsstation^38.8 cpu^0.1 tower^8.9 barebone^35.8 audio^3.7 eingang^5.2 out^6.5 core^9.0 agp^5.2-2.1 PC: kleinbetrieb^7.9 personal^18.3 db-25^2.2 technology^5.6 cache^10.0 arbeitsstation^-28.1 dynamic^7.4 bereitgestelltes^25.7 dmi^5.5 ata-100^13.7 socket^6.2 wireless^2.5 16x^10.0 1/2h^13.1 nvidia^1.0 din^4.6 tasten^13.4 international^7.2 802.1p^8.1 level^- 4.4-1.5 Notebook: eingabeperipheriegeräte^64.0 1.3 Tablet PC: tc4200^16.4 tablet^6.9 konvertibel^10.6 multibay^4.6 itu^3.3 abb^2.7 digitalstift^8.5 flugzeug^1.8 1.75 Handheld: bildschirmauflösung^39.8 smartphone^8.1 ram^0.29 speicherkarten^0.53 telefon^0.35-1.4 Textclassification based on Lucene, LibSVM & LibLinear 9

Pharmaceutical Newsletter: Highlighting Example Textclassification based on Lucene, LibSVM & LibLinear 10

Lucene, LibSVM & Liblinear Apache Lucene (http://lucene.apache.org/): Built in late 90 s by Doug Cutting. Apache release 2001 State of the art Java library for indexing and ranking Wide acceptance by 2005 LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) Authors: Chih-Chung Chang and Chih-Jen Lin NIPS 2003 feature selection challenge (third place). Full SVM implementation in C++ and Java License similar to the Apache License LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/): Machine Learning Group at National Taiwan University Optimized for the linear case (hyperplanes) Same License as LibSVM Textclassification based on Lucene, LibSVM & LibLinear 11

Feature Selection and Training Training- and Test Documents are stored in a Lucene Index Information about topics is stored in a separate untokenized field Feature Selection simply consists of comparing posting lists of topics and terms form the text-content Consistency of manual topic-assignement can be checked by using MD5-Keys for duplicates checks Lucene s Similarity Search for checking for near duplicates Feature vectors are generated from Lucene posting lists Training is completely done by LibSVM / LibLinear Instead of storing support vectors, hyperplanes are stored directly Textclassification based on Lucene, LibSVM & LibLinear 12

Vektor-Space Model for Documents and Queries Vektor-Space Model: Dokument 1: The boy on the bridge Dokument 2: The boy plays chess Term / Dokument Matrix: Boy Bridge Chess the on plays Document 1 1 1 0 2 1 0 Document 2 1 0 1 2 0 1 Cosinus Similarity: Queries treated as simply very short documents Fulltext-Search : direct product of query vector with all document vectors Document-Score: Cosinus-Similarity Textclassification based on Lucene, LibSVM & LibLinear 13

Hyperplane Query Hyperplane Equation: direct product of two vectors minus bias HyperplaneQuery: generalized BooleanQuery no coord, no idf, no querynorm A complete index may be classified by one simple search Classifying one document: build a 1-document index apply Classification Queries Many topics: Store Queries in Index (Term Boosts as Payloads) Apply Documents as Queries Textclassification based on Lucene, LibSVM & LibLinear 14

Questions? Dr. Christoph Goller Director Research Phone: +49 89 3090446-0 Fax: +49 89 3090446-29 Email: christoph.goller@intrafind.de Web: www.intrafind.de IntraFindSoftware AG Landsberger Straße 368 80687 München Germany www.intrafind.de/jobs Textclassification based on Lucene, LibSVM & LibLinear 15