Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers of patent and research documents to classify and analyze their data automatically with a high level of classification precision and recall. With the ever growing amount of patents and research papers it becomes crucial for large organisations to obtain knowledge out all this whilst keeping-up with the speed at which this data grows. The Treparel technology enables researchers to automatically classify and cluster from large document sets. The technology Treparel developed uses a svm based classification algorithm in combination with an active learning algorithm for semiautomat optimization a text classifiers towards a very high precision and recall. Additionally we use a propriatary clustering algorithm to enable landscape analysis with automatic annotation. 1 Introduction Treparel develops and delivers a new suite of integrated data analysis technology, called KMX TM that provide solutions to deliver more reliable and accurate insight in large complex data sets to allow companies to work faster, cheaper and smarter. The software of Treparel can be integrated with the Oracle database and Oracle s Data Mining technology but is also able to connect to other databases and use KMX specific text mining algorithms. This allows for fast proof-of-concept trials. The KMXsoftware is based on a client-server based architecture and can use the compute power of servers (also cloud computing servers) for high performance classification and clustering of very large document sets. Treparel is an innovative software technology and solution provider in Big Data Text Analytics and Visualization. The KMXplatform allows organizations to enhance innovation processes, improve competitive advantage, mitigate litigation risk and cost and manage interactions with customers by gaining insights from numerous sources unstructured data (such as text, blogs, email, patents, research and news literature). Global companies, government agencies, software vendors or data publishers use the KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets through: 1. Cost savings on analysis and expensive experts necessary to work with complex tools, thus working cheaper. 2. Shortening their time-to-market by reducing the analysis time by often more than 50%, thus working faster. 3. Reducing uncertainty, and thus the risk, when analyzing data and making decisions from the data, thus working more reliable. 4. Optimizing their business processes by providing accurate and robust steering information, thus working smarter. The accuracy of the results of the innovative KMX technology allows you to, on the short term, optimize your revenue streams while its robustness helps you to, on the long term, better manage your business risks. KMX stands for Knowledge Mapping and exploration and is a software system for automated pattern extraction for complex data which can be applied to any knowledge domain. Instead of building one model for one pattern from a dataset, the traditional method, KMX extracts all patterns
from a dataset and converts them into mathematical models. Subsequently KMX determines all relationships between these models and calculates whether the models are strongly or weakly related. The result is a network of models which is then visually presented in such a way that it offers a complete insight into complex and large datasets. Treparel can build upon many years of experience and knowledge of data discovery and visualization and a strong scientific cooperation with leading researchers form different universities. 2 Using machine learning for determining patterns in the data Many data mining techniques often suffer from under fitting or over fitting the data. Under fitting occurs when the algorithm used does not have the capacity to express the variability in the data. Over fitting is the opposite case, the algorithm has too much capacity and therefore it also fits noise present in the data. The cause of under and over fitting is the complexity that the model represents and which determines how much variability in the data the model will express. If too much complexity is allowed, the variability due to noise in the data is modeled as well, which causes a model that over fits patterns in the data. If the complexity is too low the models will fail to account for the true variability in the data and therefore under fit the data. One wants to use models that represent the optimal complexity and is able to control them to obtain good generalisation properties: the model s ability to predict unseen data. Where classical statistics deal with large sample size problems, statistical learning theory is the first theory which is able to adress also small sample learning problems. The complexity of models generated with machine learning algorithms using empirical data is dependent on the sample size. By taking the sampling size into account one can obtain better results than by applying asymptotic results from classical statistics. Vladimir Vapnik developed in the mid 1980 s the statistical learning theory, from which a new kind of learning machine, called support vector machine (SVM), was introduced. The basic idea of support vector machines is to determine a classifier or regression machine that minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error). In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. SVM can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed. For classification, SVMs operate by finding a hyper surface in the space of possible inputs. This hyper surface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hyper surface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical, to the training data. In machine learning, this distance is referred to as margin, and the classification surface we are looking for is therefore the maximum margin hyperplane. 2.1 Text classification using support vector machines Text classification techniques can be divided into rule based oriented approaches, machine learning techniques and techniques based on natural language processing. The first group works on simple well defined cases but in general there is the risk of not being able to generalize well for real world data. The last approach requires very detailed knowledge of the languages which is still a research area on its own and will needs detailed input knowledge (such as a thesaurus) for each text classification case. Machine learning techniques have proved to be able to produce very good results and in many cases are at least equally good or much better than any other techniques. It is also known that machine learning techniques generalize very well for many different cases with different requirement. These features are even more true for support vector machines (svm), which are a subset of machine learning techniques. Machine learning for text classification has a set of requirements that can be characterized by the following properties: Large input space The input space consists of the documents and the words within those documents whereby the text data is representation using the vector space model. For each document a vector representation is used which contains the frequency distribution of the relevant words within the document. The dimensionality of the data can be high (in number of words), even with strong simplifications. Weighting of the words using approaches such as tf/idf are important for a good representation of the vector space model for all the documents. Little training data For most learning algorithms, the required number of training examples to produce accurate classifications scales with the dimensionality of
the input space. For accurate text classifiers one normally has fewer training examples than dimensions in the feature spaces and therefore it is important to be able to determine the optimal text classifier given certain classification requirements. We start using a small set of training documents and use active learning techniques to optimize a classifier by providing more training data. Noise Most documents contain spelling mistakes and words that do not contribute to the information contained in the document. This is considered noise to the machine learning algorithms. It is therefore important to improve the signal to noise ratio of the feature vectors in the preprocessing phase. Complex learning tasks Classification of text is generally based on the semantic understanding of natural language by humans whereby the interpretation of the text and the context of the words are important. Text classifiers must be able to approximate such complex tasks whereby they have to be accurate and robust. Computational efficiency Training text classifiers in a high dimensional input space requires a lot of computations and for practical approaches it is important to be able to handle large number of features efficiently. Support vector machines are currently a very good approach which is computational efficient and for which there is a well-defined theory which describes the mechanisms with respect to its accuracy, robustness. More information can be found in Vapnik s book [11] and in [1, 3, 4, 5, 6, 7, 10] or on the books of Scholkopf [9], Cristianini [2] and for text mining the book of Joachims [8]. 3 KMXText Analytics solution Treparel s KMX text analytics solution is an client server based software platform. This platform can, in the future, grow in performance and functionality to deal with a large number of patents and addressing more complex and demanding tasks. The KMX API makes the system open for integration with existing technologies. KMX consists of a server and a client component. The server is based on linux and supports the Oracle-11g database and data mining technology. The client gui is a native windows application of which a screenshot is shown below. The solution comes as a very flexible and scalable system in terms of performance and system management. Scalability of the solution allows to handle both the growing amount of data as well as the growing complexity of the data at hand at predictable cost. This configuration provides: High performance text classification, strong on patent classification especially Support for multiple users, data sets and classifiers Rapid deployment of the system Remote administration and installation of software updates and support for automated back-up Components of KMX patent analytics are: Import of patent data into multiple columns in the database. Categorization to one or multiple categories with statistics on precision and recall for each category to which a patent is assigned. Optimization of the categorization using: document feature vector length. weighted processing of parts of the patent text (title,abstract,description,claims). support for the usage of stop words and synonyms. statistics of multiple classification approaches. ROC created using n-fold cross validation. The KMX architecture consists of a KMX server, a KMX kernel and KMX applications. The kernel is responsible for all information processing functionality, e.g. data analysis, model generation, and so forth. The KMX server embeds the KMX kernel and makes its functionality available to KMX client applications. The KMX client toolkit is used to build KMX clients. These are the interfaces that are used by our clients to access our information processing solutions. The KMX server supplies base functionality such as an XML-RPC or a REST API, user authentication, user to client mapping and a plugin infrastructure. It is a multiprocess transaction-based architecture that is able to scale across multiple CPUs and machines. The KMX server supplies all infrastructure for the uploading of large input datasets, the downloading of large result datasets and authentication and all concurrency issues. A KMX client
Figure 1: Overview of the KMX Patent Analytics GUI showing patent titles and their labler, the cluster visualization and a section of the full text of a selected patent (see cross hair in the visualization) and the brushes (green and red) indicating the training documents of the classifier. The classification score is shown from blue (positive) to yellow (negative) in the patent landscape. The training documents are indicated by the color of the brushes (green and red) makes requests to the KMX server and these requests are passed to the KMX server which immediately queries the authorisation database with full details of the client request. About Treparel Treparel is a leading global software provider in Big Data Text Analytics and Visualization. The KMX platform allows organizations to enhance innovation processes, improve competitive advantage, mitigate litigation risk and cost and manage interactions with customers by gaining insights from numerous sources unstructured data (text, application notes, images, blogs, email and patents). Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets allowing them to make better informed decisions. For more information contact info@treparel.com or go to http://www.treparel.com References [1] C.J.C. Burges. Simplified support vector decision rules, 1996. [2] Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000). [3] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1-25, 1995. [4] Franca Debole and Fabrizio Sebastiani, An analysis of the relative difficulty of Reuters-21578 Subsets. [5] C.J. Fall, A Torsvari, K. Benzined, G. Karetka, Automated categorization in the international patent classification, [6] Michelle J. Fisher, Jonathan E. Fieldsend, Richard M. Everson, Precision and recall optimization for information access tasks. [7] George Forman, An extensive empirical study of feature selection metrics for text classification, Journal of machine learning research, vol. 3, pp. 1289-1305, 2003.
[8] Thorsten Joachims, Learning to classify text using support vector machines; methods, theory and algorithms, Kluwer Academic Publishers, ISBN:0-7923-7679-X. [9] Scholkopf, S., Burges, C. J. C., Smola, A. J., Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, (1999). [10] V. Vapnik, S.E. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processings Systems 9, pages 281-287, San Mateo, CA, 1997. Morgan Kaufmann Publishers. [11] Vapnik, V. Statistical Learning Theory. Wiley- Interscience, New York, (1998).