Knowledge Discovery from patents using KMX Text Analytics



Similar documents
Big Data: Rethinking Text Visualization

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Big Data Text Mining and Visualization. Anton Heijs

Active Learning SVM for Blogs recommendation

Support Vector Machines with Clustering for Training with Very Large Datasets

Spam detection with data mining method:

Scalable Developments for Big Data Analytics in Remote Sensing

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining + Business Intelligence. Integration, Design and Implementation

The Scientific Data Mining Process

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

WE DEFINE spam as an message that is unwanted basically

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

SVM Ensemble Model for Investment Prediction

Database Marketing, Business Intelligence and Knowledge Discovery

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Server Load Prediction

Support Vector Machines Explained

Data Mining Solutions for the Business Environment

SPATIAL DATA CLASSIFICATION AND DATA MINING

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

Maschinelles Lernen mit MATLAB

Visualization methods for patent data

Data Mining Analytics for Business Intelligence and Decision Support

IBM's Fraud and Abuse, Analytics and Management Solution

The Data Mining Process

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Advanced In-Database Analytics

SQL Server 2005 Features Comparison

Machine Learning in Spam Filtering

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

KEITH LEHNERT AND ERIC FRIEDRICH

CUSTOMER Presentation of SAP Predictive Analytics

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

A Statistical Text Mining Method for Patent Analysis

Introduction to Data Mining

Data Mining - Evaluation of Classifiers

Sentiment analysis for news articles

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Universal PMML Plug-in for EMC Greenplum Database

Sentiment analysis on tweets in a financial domain

Data Mining mit der JMSL Numerical Library for Java Applications

Using Data Mining for Mobile Communication Clustering and Characterization

EMC Data Protection Advisor 6.0

Clustering Technique in Data Mining for Text Documents

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

RAVEN: A GUI and an Artificial Intelligence Engine in a Dynamic PRA Framework

Final Project Report

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Introduction. A. Bellaachia Page: 1

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

IBM SPSS Modeler 15 In-Database Mining Guide

A Review of Data Mining Techniques

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Hexaware E-book on Predictive Analytics

A Demonstration of a Robust Context Classification System (CCS) and its Context ToolChain (CTC)

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Data Mining: A Preprocessing Engine

TIBCO Spotfire Guided Analytics. Transferring Best Practice Analytics from Experts to Everyone

Microsoft Azure Machine learning Algorithms

Predictive Analytics

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Search and Information Retrieval

Clustering Connectionist and Statistical Language Processing

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Social Media Mining. Data Mining Essentials

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Content-Based Recommendation

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Classification of Bad Accounts in Credit Card Industry

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Healthcare Measurement Analysis Using Data mining Techniques

Pattern-Aided Regression Modelling and Prediction Model Analysis

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining Part 5. Prediction

Data Mining Algorithms Part 1. Dejan Sarka

Transcription:

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers of patent and research documents to classify and analyze their data automatically with a high level of classification precision and recall. With the ever growing amount of patents and research papers it becomes crucial for large organisations to obtain knowledge out all this whilst keeping-up with the speed at which this data grows. The Treparel technology enables researchers to automatically classify and cluster from large document sets. The technology Treparel developed uses a svm based classification algorithm in combination with an active learning algorithm for semiautomat optimization a text classifiers towards a very high precision and recall. Additionally we use a propriatary clustering algorithm to enable landscape analysis with automatic annotation. 1 Introduction Treparel develops and delivers a new suite of integrated data analysis technology, called KMX TM that provide solutions to deliver more reliable and accurate insight in large complex data sets to allow companies to work faster, cheaper and smarter. The software of Treparel can be integrated with the Oracle database and Oracle s Data Mining technology but is also able to connect to other databases and use KMX specific text mining algorithms. This allows for fast proof-of-concept trials. The KMXsoftware is based on a client-server based architecture and can use the compute power of servers (also cloud computing servers) for high performance classification and clustering of very large document sets. Treparel is an innovative software technology and solution provider in Big Data Text Analytics and Visualization. The KMXplatform allows organizations to enhance innovation processes, improve competitive advantage, mitigate litigation risk and cost and manage interactions with customers by gaining insights from numerous sources unstructured data (such as text, blogs, email, patents, research and news literature). Global companies, government agencies, software vendors or data publishers use the KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets through: 1. Cost savings on analysis and expensive experts necessary to work with complex tools, thus working cheaper. 2. Shortening their time-to-market by reducing the analysis time by often more than 50%, thus working faster. 3. Reducing uncertainty, and thus the risk, when analyzing data and making decisions from the data, thus working more reliable. 4. Optimizing their business processes by providing accurate and robust steering information, thus working smarter. The accuracy of the results of the innovative KMX technology allows you to, on the short term, optimize your revenue streams while its robustness helps you to, on the long term, better manage your business risks. KMX stands for Knowledge Mapping and exploration and is a software system for automated pattern extraction for complex data which can be applied to any knowledge domain. Instead of building one model for one pattern from a dataset, the traditional method, KMX extracts all patterns

from a dataset and converts them into mathematical models. Subsequently KMX determines all relationships between these models and calculates whether the models are strongly or weakly related. The result is a network of models which is then visually presented in such a way that it offers a complete insight into complex and large datasets. Treparel can build upon many years of experience and knowledge of data discovery and visualization and a strong scientific cooperation with leading researchers form different universities. 2 Using machine learning for determining patterns in the data Many data mining techniques often suffer from under fitting or over fitting the data. Under fitting occurs when the algorithm used does not have the capacity to express the variability in the data. Over fitting is the opposite case, the algorithm has too much capacity and therefore it also fits noise present in the data. The cause of under and over fitting is the complexity that the model represents and which determines how much variability in the data the model will express. If too much complexity is allowed, the variability due to noise in the data is modeled as well, which causes a model that over fits patterns in the data. If the complexity is too low the models will fail to account for the true variability in the data and therefore under fit the data. One wants to use models that represent the optimal complexity and is able to control them to obtain good generalisation properties: the model s ability to predict unseen data. Where classical statistics deal with large sample size problems, statistical learning theory is the first theory which is able to adress also small sample learning problems. The complexity of models generated with machine learning algorithms using empirical data is dependent on the sample size. By taking the sampling size into account one can obtain better results than by applying asymptotic results from classical statistics. Vladimir Vapnik developed in the mid 1980 s the statistical learning theory, from which a new kind of learning machine, called support vector machine (SVM), was introduced. The basic idea of support vector machines is to determine a classifier or regression machine that minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error). In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. SVM can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed. For classification, SVMs operate by finding a hyper surface in the space of possible inputs. This hyper surface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hyper surface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical, to the training data. In machine learning, this distance is referred to as margin, and the classification surface we are looking for is therefore the maximum margin hyperplane. 2.1 Text classification using support vector machines Text classification techniques can be divided into rule based oriented approaches, machine learning techniques and techniques based on natural language processing. The first group works on simple well defined cases but in general there is the risk of not being able to generalize well for real world data. The last approach requires very detailed knowledge of the languages which is still a research area on its own and will needs detailed input knowledge (such as a thesaurus) for each text classification case. Machine learning techniques have proved to be able to produce very good results and in many cases are at least equally good or much better than any other techniques. It is also known that machine learning techniques generalize very well for many different cases with different requirement. These features are even more true for support vector machines (svm), which are a subset of machine learning techniques. Machine learning for text classification has a set of requirements that can be characterized by the following properties: Large input space The input space consists of the documents and the words within those documents whereby the text data is representation using the vector space model. For each document a vector representation is used which contains the frequency distribution of the relevant words within the document. The dimensionality of the data can be high (in number of words), even with strong simplifications. Weighting of the words using approaches such as tf/idf are important for a good representation of the vector space model for all the documents. Little training data For most learning algorithms, the required number of training examples to produce accurate classifications scales with the dimensionality of

the input space. For accurate text classifiers one normally has fewer training examples than dimensions in the feature spaces and therefore it is important to be able to determine the optimal text classifier given certain classification requirements. We start using a small set of training documents and use active learning techniques to optimize a classifier by providing more training data. Noise Most documents contain spelling mistakes and words that do not contribute to the information contained in the document. This is considered noise to the machine learning algorithms. It is therefore important to improve the signal to noise ratio of the feature vectors in the preprocessing phase. Complex learning tasks Classification of text is generally based on the semantic understanding of natural language by humans whereby the interpretation of the text and the context of the words are important. Text classifiers must be able to approximate such complex tasks whereby they have to be accurate and robust. Computational efficiency Training text classifiers in a high dimensional input space requires a lot of computations and for practical approaches it is important to be able to handle large number of features efficiently. Support vector machines are currently a very good approach which is computational efficient and for which there is a well-defined theory which describes the mechanisms with respect to its accuracy, robustness. More information can be found in Vapnik s book [11] and in [1, 3, 4, 5, 6, 7, 10] or on the books of Scholkopf [9], Cristianini [2] and for text mining the book of Joachims [8]. 3 KMXText Analytics solution Treparel s KMX text analytics solution is an client server based software platform. This platform can, in the future, grow in performance and functionality to deal with a large number of patents and addressing more complex and demanding tasks. The KMX API makes the system open for integration with existing technologies. KMX consists of a server and a client component. The server is based on linux and supports the Oracle-11g database and data mining technology. The client gui is a native windows application of which a screenshot is shown below. The solution comes as a very flexible and scalable system in terms of performance and system management. Scalability of the solution allows to handle both the growing amount of data as well as the growing complexity of the data at hand at predictable cost. This configuration provides: High performance text classification, strong on patent classification especially Support for multiple users, data sets and classifiers Rapid deployment of the system Remote administration and installation of software updates and support for automated back-up Components of KMX patent analytics are: Import of patent data into multiple columns in the database. Categorization to one or multiple categories with statistics on precision and recall for each category to which a patent is assigned. Optimization of the categorization using: document feature vector length. weighted processing of parts of the patent text (title,abstract,description,claims). support for the usage of stop words and synonyms. statistics of multiple classification approaches. ROC created using n-fold cross validation. The KMX architecture consists of a KMX server, a KMX kernel and KMX applications. The kernel is responsible for all information processing functionality, e.g. data analysis, model generation, and so forth. The KMX server embeds the KMX kernel and makes its functionality available to KMX client applications. The KMX client toolkit is used to build KMX clients. These are the interfaces that are used by our clients to access our information processing solutions. The KMX server supplies base functionality such as an XML-RPC or a REST API, user authentication, user to client mapping and a plugin infrastructure. It is a multiprocess transaction-based architecture that is able to scale across multiple CPUs and machines. The KMX server supplies all infrastructure for the uploading of large input datasets, the downloading of large result datasets and authentication and all concurrency issues. A KMX client

Figure 1: Overview of the KMX Patent Analytics GUI showing patent titles and their labler, the cluster visualization and a section of the full text of a selected patent (see cross hair in the visualization) and the brushes (green and red) indicating the training documents of the classifier. The classification score is shown from blue (positive) to yellow (negative) in the patent landscape. The training documents are indicated by the color of the brushes (green and red) makes requests to the KMX server and these requests are passed to the KMX server which immediately queries the authorisation database with full details of the client request. About Treparel Treparel is a leading global software provider in Big Data Text Analytics and Visualization. The KMX platform allows organizations to enhance innovation processes, improve competitive advantage, mitigate litigation risk and cost and manage interactions with customers by gaining insights from numerous sources unstructured data (text, application notes, images, blogs, email and patents). Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets allowing them to make better informed decisions. For more information contact info@treparel.com or go to http://www.treparel.com References [1] C.J.C. Burges. Simplified support vector decision rules, 1996. [2] Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000). [3] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1-25, 1995. [4] Franca Debole and Fabrizio Sebastiani, An analysis of the relative difficulty of Reuters-21578 Subsets. [5] C.J. Fall, A Torsvari, K. Benzined, G. Karetka, Automated categorization in the international patent classification, [6] Michelle J. Fisher, Jonathan E. Fieldsend, Richard M. Everson, Precision and recall optimization for information access tasks. [7] George Forman, An extensive empirical study of feature selection metrics for text classification, Journal of machine learning research, vol. 3, pp. 1289-1305, 2003.

[8] Thorsten Joachims, Learning to classify text using support vector machines; methods, theory and algorithms, Kluwer Academic Publishers, ISBN:0-7923-7679-X. [9] Scholkopf, S., Burges, C. J. C., Smola, A. J., Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, (1999). [10] V. Vapnik, S.E. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processings Systems 9, pages 281-287, San Mateo, CA, 1997. Morgan Kaufmann Publishers. [11] Vapnik, V. Statistical Learning Theory. Wiley- Interscience, New York, (1998).