Big Data Text Mining and Visualization. Anton Heijs

Similar documents
Knowledge Discovery from patents using KMX Text Analytics

Big Data: Rethinking Text Visualization

Visualization methods for patent data

Introduction. A. Bellaachia Page: 1

The Scientific Data Mining Process

Introduction to Data Mining

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Pentaho Data Mining Last Modified on January 22, 2007

Clustering Technique in Data Mining for Text Documents

Introduction to Data Mining

Sanjeev Kumar. contribute

KnowledgeSEEKER Marketing Edition

Active Learning SVM for Blogs recommendation

SPATIAL DATA CLASSIFICATION AND DATA MINING

not possible or was possible at a high cost for collecting the data.

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

An Overview of Knowledge Discovery Database and Data mining Techniques

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

The Data Mining Process

The University of Jordan

Search and Information Retrieval

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Protein Protein Interaction Networks

Final Project Report

Data Mining Part 5. Prediction

How To Make Sense Of Data With Altilia

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A Statistical Text Mining Method for Patent Analysis

Information Management course

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

III JORNADAS DE DATA MINING

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

TEXT ANALYTICS INTEGRATION

A Review of Data Mining Techniques

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Analytics on Big Data

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

DATA CENTER INFRASTRUCTURE MANAGEMENT

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Industry 4.0 and Big Data

Data Mining Analytics for Business Intelligence and Decision Support

Sentiment Analysis on Big Data

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Travis Goodwin & Sanda Harabagiu

Customer Classification And Prediction Based On Data Mining Technique

Dan French Founder & CEO, Consider Solutions

Voice. listen, understand and respond. enherent. wish, choice, or opinion. openly or formally expressed. May Merriam Webster.

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Social Media Mining. Data Mining Essentials

Supervised Learning (Big Data Analytics)

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

IBM SPSS Modeler Premium

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Introduction to Pattern Recognition

The Edge Editions of SAP InfiniteInsight Overview

Experiments in Web Page Classification for Semantic Web

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Patent Big Data Analysis by R Data Language for Technology Management

Auto-Classification for Document Archiving and Records Declaration

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

ProteinQuest user guide

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

CHAPTER-24 Mining Spatial Databases

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

STAR WARS AND THE ART OF DATA SCIENCE

The Big Data Paradigm Shift. Insight Through Automation

Projektgruppe. Categorization of text documents via classification

Text Classification Using Symbolic Data Analysis

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Survey Results: Requirements and Use Cases for Linguistic Linked Data

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

Why are Organizations Interested?

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

Outline. What is Big data and where they come from? How we deal with Big data?

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

IC05 Introduction on Networks &Visualization Nov

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Transcription:

Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark 26 Suite 2-26 2628 XH Delft Netherlands info@treparel.com Big Data Text Mining and Visualization Anton Heijs

Overview Challenges for Big Data analytics Machine learning Clustering Classification Visualization Analyze more data and capturing its context Page 2

Big Data Analytics Drivers for Big Data analytics Data grows fast and 80% of the data is text Less people with less time for in-depth analysis Growing need for data driven decisions The meaning and implications of patterns in the data is key Knowledge discovery from text data is providing: More information in the large tail of big data (Zipf s law) Insight and discovering relationships Combined analysis in context with more depth from Research & Patents data to News & Legal data Meaningful analysis of the data by combining patterns in the text with semantic concept extracted from the text Page 3

Big Data processing concepts Move processing to the data Process data sequentially, avoid random access Seamless scalability, scale out, not up

Types of Data Sets Records Relational records Data matrix, e.g., numerical matrix Document data: text documents: term-frequency vector Transaction data Graph and networks Web, social or information networks Molecular structures Ordered data Video data: sequence of images Temporal data: time-series Sequential data: transaction sequences Genetic sequence data Spatial, image and multimedia data: Spatial data: maps Image data, Video data: Data characteristics Dimensionality Resolution Distribution 5

What does visualization provide Purpose of Visualization Gain insight by mapping data onto graphical primitives Provide qualitative overview of large data sets Explore patterns, trends, structure, irregularities, relationships among data. Help find interesting regions and suitable parameters for further quantitative analysis. Provide a visual proof of computer representations derived April 17, 2012 Data Mining: Concepts and Techniques 6

Text Analytics examples Research papers (on Ebola) Wikileaks cables over time Page 7

Clustering of chinese text using patent from the IFI Claims database Page 8 200802

Cluster visualization of classified patents 9

Automatic annotation and zooming on the documents Zoomlevel 1 Page 10 Zoomlevel 2 Zoomlevel 3

Automated Text Classification Explained Original Data TRAINING DATA Known Output Yes No Yes Text Classifier Text Data Text Preprocessing Text Classification Presentation & Deployment New Data TEST DATA Unknown Output??? Predicted Output Yes No Yes Page 11 200802

Building the feature vectors Original Text Tokenization Stopword removal Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans sing; o; goddess; the; anger; of; achilles; son; of; peleus; that; brought; countless; ills; upon; the; achaeans sing; goddess; anger; achilles; son; peleus; brought; countless; ills; achaeans Page 12 200802 Stemming sing; god; anger; achilles; son; peleus; brin; count; ill; achae Vectorization (0,0,1,0,1,0,0,0..) Very high dimensional! (d 1000) Very sparse!

Relevant Irrelevant Relevance Building a classifier and doing the classification Acquire data Label subset Ranked results Page 13 Page 13

Using classification to generate a ranked list Ranked results Threshold Classification Class A Class B Page 14 Page 14

Multi class classification Class 1 Class 2 Class n Page 15 Page 15

Classifying the vectors Score = 100 Vectors of documents of class A Classes are separated by a line (d=2) a plane (d=3) or a hyperplane (d>3). Vectors of documents not of class A The Support Vector Machines (SVM) algorithm is used to determine the optimal separating (hyper-)plane Unknown examples (red dot) are classified according to their position with respect to the hyperplane. Score = 50 Score = 0 Page 16 200802

Improving the classifier Once we have created the first classifier and used it to classify the rest of the available documents, we can use the classification results to suggest additional training documents. Suggestion Labeling Improved Page 17 200802

Control over the models Robustness High Robustness Under Fit Model High Robustness Training Error = Test Error Robust Model Low Training Error Low Test Error Low Robustness Page 18 200711 Low accuracy Over Fit Model Low Robustness No Training Error, High Test Error High accuracy Quality of fit

Concept detection using document classification 1. Visualization => multiple topic clusters 2. Select cluster => select documents with similar topics 3. Select training documents within the subcluster 4. Build classifier and classify 5. Rank documents => find set of documents with related concepts 6. Extract concepts Extracting concepts in context from classified documents Page 19

Why is semantics important in Big Data Analytics Semantics is capturing the meaning of terms by Thesauri Taxonomy Ontology Semantics is required for meaningful and in-depth interpretation of patterns in the data Capture the precise meaning of terms which is essential because we can only build on pre-existing knowledge Better and more precise search result Efficient knowledge discovery This enables to search more in an integrated approach to multiple sources Where is semantics applied? Data / Text mining Data integration and information linkage Linking concepts over multiple data sources Page 20

Extending the query with special terms Proportion IPC Classes Automatic determined representative words 31.3% F02C F01K F25B F22B B01D steam cooling heat water air 26.4% F02C F02K F28D F17C F01C compressor air compressed fluid combustion 7.7% F02C F01D F03B F23R F02K edge blade trailing region rotor 7.4% F02C F01K F01D F22B F04D steam pressure blade cooling intermediate 5.9% F02C C01B B01D C10J F23G vocs carbon hydrogen process synthesis Page 21 200802

Auto reporting from the context Priority Countries Priority Years Coverage Countries Page 22 200802

KMX Technology overview Acquire documents Text Preprocessig and Indexing Clustering Classification Visualization Semantic Analysis Taxonomies, Ontologies Result presentation Page 23

Clustering and point placement approaches Page 24 200711

From text to image clustering Page 25 200711

Clustering of a Medical Image Data Set Page 26 200711

Clustering of a Medical Image Data Set Page 27 200711

Advantages from machine learning classifiers Better Coverage. Relevance ranking allows broader initial result set. Quality. high precision and recall. Seamless. Integrates into current processes. Faster Efficient. Only a fraction of document set is studied by expert. Reuse. Can be reapplied to new document sets. Sharing. Can be shared. Page 28 200802

Conclusions Big Data Analytics : The data is growing in size and complexity Combined analysis of multiple data sets from structured data (table images) to unstructured (text) We need to find patterns in context from structured and unstructured data using Machine learning : use classification and clustering combined Visualization : enable the user to explore the patterns in the data to make better decisions faster Page 29 200802

T R E P A R E L TRENDS PATTERNS - RELATIONS ENABLING YOU TO SEE MORE! Page 30 200802