Big Data Text Mining and Visualization. Anton Heijs

Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark 26 Suite 2-26 2628 XH Delft Netherlands info@treparel.com Big Data Text Mining and Visualization Anton Heijs

Overview Challenges for Big Data analytics Machine learning Clustering Classification Visualization Analyze more data and capturing its context Page 2

Big Data Analytics Drivers for Big Data analytics Data grows fast and 80% of the data is text Less people with less time for in-depth analysis Growing need for data driven decisions The meaning and implications of patterns in the data is key Knowledge discovery from text data is providing: More information in the large tail of big data (Zipf s law) Insight and discovering relationships Combined analysis in context with more depth from Research & Patents data to News & Legal data Meaningful analysis of the data by combining patterns in the text with semantic concept extracted from the text Page 3

Big Data processing concepts Move processing to the data Process data sequentially, avoid random access Seamless scalability, scale out, not up

Types of Data Sets Records Relational records Data matrix, e.g., numerical matrix Document data: text documents: term-frequency vector Transaction data Graph and networks Web, social or information networks Molecular structures Ordered data Video data: sequence of images Temporal data: time-series Sequential data: transaction sequences Genetic sequence data Spatial, image and multimedia data: Spatial data: maps Image data, Video data: Data characteristics Dimensionality Resolution Distribution 5

What does visualization provide Purpose of Visualization Gain insight by mapping data onto graphical primitives Provide qualitative overview of large data sets Explore patterns, trends, structure, irregularities, relationships among data. Help find interesting regions and suitable parameters for further quantitative analysis. Provide a visual proof of computer representations derived April 17, 2012 Data Mining: Concepts and Techniques 6

Text Analytics examples Research papers (on Ebola) Wikileaks cables over time Page 7

Clustering of chinese text using patent from the IFI Claims database Page 8 200802

Cluster visualization of classified patents 9

Automatic annotation and zooming on the documents Zoomlevel 1 Page 10 Zoomlevel 2 Zoomlevel 3

Automated Text Classification Explained Original Data TRAINING DATA Known Output Yes No Yes Text Classifier Text Data Text Preprocessing Text Classification Presentation & Deployment New Data TEST DATA Unknown Output??? Predicted Output Yes No Yes Page 11 200802

Building the feature vectors Original Text Tokenization Stopword removal Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans sing; o; goddess; the; anger; of; achilles; son; of; peleus; that; brought; countless; ills; upon; the; achaeans sing; goddess; anger; achilles; son; peleus; brought; countless; ills; achaeans Page 12 200802 Stemming sing; god; anger; achilles; son; peleus; brin; count; ill; achae Vectorization (0,0,1,0,1,0,0,0..) Very high dimensional! (d 1000) Very sparse!

Relevant Irrelevant Relevance Building a classifier and doing the classification Acquire data Label subset Ranked results Page 13 Page 13

Using classification to generate a ranked list Ranked results Threshold Classification Class A Class B Page 14 Page 14

Multi class classification Class 1 Class 2 Class n Page 15 Page 15

Classifying the vectors Score = 100 Vectors of documents of class A Classes are separated by a line (d=2) a plane (d=3) or a hyperplane (d>3). Vectors of documents not of class A The Support Vector Machines (SVM) algorithm is used to determine the optimal separating (hyper-)plane Unknown examples (red dot) are classified according to their position with respect to the hyperplane. Score = 50 Score = 0 Page 16 200802

Improving the classifier Once we have created the first classifier and used it to classify the rest of the available documents, we can use the classification results to suggest additional training documents. Suggestion Labeling Improved Page 17 200802

Control over the models Robustness High Robustness Under Fit Model High Robustness Training Error = Test Error Robust Model Low Training Error Low Test Error Low Robustness Page 18 200711 Low accuracy Over Fit Model Low Robustness No Training Error, High Test Error High accuracy Quality of fit

Concept detection using document classification 1. Visualization => multiple topic clusters 2. Select cluster => select documents with similar topics 3. Select training documents within the subcluster 4. Build classifier and classify 5. Rank documents => find set of documents with related concepts 6. Extract concepts Extracting concepts in context from classified documents Page 19

Why is semantics important in Big Data Analytics Semantics is capturing the meaning of terms by Thesauri Taxonomy Ontology Semantics is required for meaningful and in-depth interpretation of patterns in the data Capture the precise meaning of terms which is essential because we can only build on pre-existing knowledge Better and more precise search result Efficient knowledge discovery This enables to search more in an integrated approach to multiple sources Where is semantics applied? Data / Text mining Data integration and information linkage Linking concepts over multiple data sources Page 20

Extending the query with special terms Proportion IPC Classes Automatic determined representative words 31.3% F02C F01K F25B F22B B01D steam cooling heat water air 26.4% F02C F02K F28D F17C F01C compressor air compressed fluid combustion 7.7% F02C F01D F03B F23R F02K edge blade trailing region rotor 7.4% F02C F01K F01D F22B F04D steam pressure blade cooling intermediate 5.9% F02C C01B B01D C10J F23G vocs carbon hydrogen process synthesis Page 21 200802

Auto reporting from the context Priority Countries Priority Years Coverage Countries Page 22 200802

KMX Technology overview Acquire documents Text Preprocessig and Indexing Clustering Classification Visualization Semantic Analysis Taxonomies, Ontologies Result presentation Page 23

Clustering and point placement approaches Page 24 200711

From text to image clustering Page 25 200711

Clustering of a Medical Image Data Set Page 26 200711

Clustering of a Medical Image Data Set Page 27 200711

Advantages from machine learning classifiers Better Coverage. Relevance ranking allows broader initial result set. Quality. high precision and recall. Seamless. Integrates into current processes. Faster Efficient. Only a fraction of document set is studied by expert. Reuse. Can be reapplied to new document sets. Sharing. Can be shared. Page 28 200802

Conclusions Big Data Analytics : The data is growing in size and complexity Combined analysis of multiple data sets from structured data (table images) to unstructured (text) We need to find patterns in context from structured and unstructured data using Machine learning : use classification and clustering combined Visualization : enable the user to explore the patterns in the data to make better decisions faster Page 29 200802

T R E P A R E L TRENDS PATTERNS - RELATIONS ENABLING YOU TO SEE MORE! Page 30 200802