Real-Life Industrial Process Data Mining RISC-PROS. Activities of the RISC-PROS project



Similar documents
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

CS 6220: Data Mining Techniques Course Project Description

Chapter ML:XI (continued)

How To Cluster

TIETS34 Seminar: Data Mining on Biometric identification

Feature Subset Selection in Spam Detection

Categorical Data Visualization and Clustering Using Subjective Factors

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Final Project Report

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Azure Machine Learning, SQL Data Mining and R

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

The Delicate Art of Flower Classification

Analysis Tools and Libraries for BigData

An Introduction to Data Mining

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

The Scientific Data Mining Process

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Social Media Mining. Data Mining Essentials

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Knowledge Discovery from patents using KMX Text Analytics

Introduction to Data Mining

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

A Content based Spam Filtering Using Optical Back Propagation Technique

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Classification algorithm in Data mining: An Overview

Analecta Vol. 8, No. 2 ISSN

Data Mining and Visualization

Machine Learning using MapReduce

Visualization of large data sets using MDS combined with LVQ.

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining: Overview. What is Data Mining?

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Data Mining for Fun and Profit

Sub-class Error-Correcting Output Codes

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Computer Graphics AACHEN AACHEN AACHEN AACHEN. Public Perception of CG. Computer Graphics Research. Methodological Approaches

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data Analytics CSCI 4030

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining: STATISTICA

Design call center management system of e-commerce based on BP neural network and multifractal

Data Mining. Nonlinear Classification

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Maschinelles Lernen mit MATLAB

L13: cross-validation

An Overview of Knowledge Discovery Database and Data mining Techniques

D-optimal plans in observational studies

A Survey on Pre-processing and Post-processing Techniques in Data Mining

How To Use Neural Networks In Data Mining

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Comparison of K-means and Backpropagation Data Mining Algorithms

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Comparison of Supervised and Unsupervised Learning Classifiers for Travel Recommendations

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Virtual Landmarks for the Internet

Introduction to Data Mining

Neural Networks and Support Vector Machines

Standardization and Its Effects on K-Means Clustering Algorithm

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Introduction to Clustering

Predictive Data modeling for health care: Comparative performance study of different prediction models

3 An Illustrative Example

Performance Metrics for Graph Mining Tasks

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Lecture 6. Artificial Neural Networks

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

High-dimensional labeled data analysis with Gabriel graphs

Numerical Algorithms Group

DATA ANALYSIS II. Matrix Algorithms

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Detection of Heart Diseases by Mathematical Artificial Intelligence Algorithm Using Phonocardiogram Signals

1. Classification problems

Network Intrusion Detection Systems

Data, Measurements, Features

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Learning is a very general term denoting the way in which agents:

CSC 177 Fall 2014 Team Project Final Report

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

DATA PREPARATION FOR DATA MINING

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Practical Graph Mining with R. 5. Link Analysis

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Big-data Analytics: Challenges and Opportunities

KEITH LEHNERT AND ERIC FRIEDRICH

Advanced analytics at your hands

Analytics on Big Data

Distances, Clustering, and Classification. Heatmaps

Data Mining Part 5. Prediction

Transcription:

Real-Life Industrial Process Data Mining Activities of the RISC-PROS project Department of Mathematical Information Technology University of Jyväskylä Finland Postgraduate Seminar in Information Technology Jyväskylä 26th Feb 2009

Outline 1 The RISC-PROS project Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting 2 Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS 3 Project Status, Relation to Other Research A Side Note About Openness of Algorithms

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Outline 1 The RISC-PROS project Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting 2 Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS 3 Project Status, Relation to Other Research A Side Note About Openness of Algorithms

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Data Mining Concepts Data Mining is a broad field of goals, methods, and techniques dealing with masses of numerical data. The data in data mining can be anything which can be numerically encoded, usually measurements from a real-world system (such as a human heart, lake ecosystem, or an industrial factory process). Objective of DM is to extract useful knowledge or patterns from the numerical bulk.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Data the Ore of Data Mining In the most confined case, DM deals with dataset matrices. That is, sets of row vectors of variables measured from a real-life system (or produced by simulation of such). A very traditional example is the Iris flower dataset: Botanic measurements: 5.1, 3.5, 1.4, 0.2 4.9, 3.0, 1.4, 0.2 4.7, 3.2, 1.3, 0.2 4.6, 3.1, 1.5, 0.2 5.0, 3.6, 1.4, 0.2 5.4, 3.9, 1.7, 0.4 4.6, 3.4, 1.4, 0.3 5.0, 3.4, 1.5, 0.2 4.4, 2.9, 1.4, 0.2...

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting General Goals and Successes of DM So. We can collect data from plants, animals, humans, genes, chemicals, traffic, web shopping carts, Facebook,... Successful DM can provide: A web search engine that quickly finds readings for you, based on search keywords. Protection against email spam or computer viruses. Automatic suggestions of which movies you d like to watch. Targeted advertising, if you had something to sell. Automatic computation of populations based on digital photographs, if you were a biologist. Diagnoses, if you were a physician.... and countless other things that improve life, humanity, or business.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting General Challenges in Data Mining When there is little data, there s nothing to mine. When the size and dimension of the data matrix increase, DM becomes useful but challenging Measurements are noisy Measurements are wrong Measurements are missing Measurements, by themselves, don t tell what they can or should be used for...

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Some Specific Data Mining Activities Some examples of activities that could be considered DM are Clustering - finding groups in existing data Classification - finding a group label of new data, based on learning an existing one Dimension Reduction - mapping of data to a lower dimension suitably. These three tasks are of special interest to our on-going research project.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Clustering Figure: Clustering is the assignment of objects (observations) into categories (clusters) in a way that the objects in the same cluster are somehow similar to each other, and different from objects in other clusters.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Classification Figure: Classification is the assignment of a new, never before seen, object (observation) into a category (class); this can be called recognition.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Dimension Reduction Botanic measurements: 5.1, 3.5, 1.4, 0.2 4.9, 3.0, 1.4, 0.2 4.7, 3.2, 1.3, 0.2 4.6, 3.1, 1.5, 0.2 5.0, 3.6, 1.4, 0.2 5.4, 3.9, 1.7, 0.4 4.6, 3.4, 1.4, 0.3 5.0, 3.4, 1.5, 0.2 4.4, 2.9, 1.4, 0.2... Figure: Dimension reduction is a mapping of high-dimensional data (observations with many variables) into a new dataset with less variables; a useful dimension reduction technique preserves indicative properties of the original observations.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting Main Goals and Collaboration We are now underway in a project called RISC-PROS. Longer title: Real Time Industrial Process and Sensor Network Management Using Data Mining and Dimension Reduction Researchers from Jyväskylä and Tel Aviv. (Me, Neta Rabin, Tommi Kärkkäinen, Amir Averbuch) Companies from Finland: Metso Paper, Fluidhouse, Patria Aviation Funding from the European Union, Regional Development Fund, via Tekes Total 2 years Goal: Business boost for the industry via applying and integrating novel data mining techniques in the products and development.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting The Industrial Setting: Lines of Business Metso Paper develops paper machinery; we look for new ways to help the engineer in interpreting measurements from a prototype machine. Fluidhouse produces fluid automation systems, for example hydraulic units in factories and dockyards; we look for automatic prediction of maintenance schedules. Patria Aviation develops airborne systems for civil and defence industries; our research deals with airborne radars.

Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting The Industrial Setting: Extents of Data In the industrial cases of the project, so far we seem to deal with: Measurements of ca. 900 variables from a continuously evolving physical process (sampling rate involved) Database storage that records point values only when there has been a change greater than a threshold (interpolation needed if we want a full data matrix) Processes spanning multiple scales of time: instantaneous to days, weeks or months. Gigabytes of numbers.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Outline 1 The RISC-PROS project Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting 2 Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS 3 Project Status, Relation to Other Research A Side Note About Openness of Algorithms

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Research Background: In addition to seeking a business boost RISC-PROS aims to continue the development of three computational methods that have been researched earlier at Jyväskylä and at Tel Aviv: Robust Clustering Robust MLP Training with Regularization Diffusion Geometry Framework

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Statistical robustness Define the l q norm as follows: ( p ) 1/q u q = (u) i q, q <. (1) i=1 The norm l 2 is the Euclidian distance of a point from the origin, l 1 is the city block distance. A general K-means clustering algorithm would minimize min c N n,m k R pj (c, {m k} K k=1 ) = n P i (x i m (c)i ) α q (2) i=1 subject to (c) i {1,..., K } for all i = 1,..., n. For notation key and details, please observe the public report of RISC-PROS, to be published on-line.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS A Traditional Multi-Layered Perceptron Figure: Neural architecture of a simple MLP neural network

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Novel Ideas for Training the MLP Define the l q norm as follows: ( p ) 1/q u q = (u) i q, q <. (3) i=1 We choose to train our MLP by minimizing L α q,β ({Wl }) = 1 n N ({W l α L })(x i ) y i αn +β 1 W l i,j q 2S 2 l i=1 l=1 (i,j) I l (4) for β 0. For notation key and details, please observe the public report of RISC-PROS, to be published on-line.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS The Problem of Overfitting Figure: Progression of cost and accuracy with β = 0 (PenDigits 16-30-10). When to stop training? As the cost function is minimized,

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Heuristic Search for a Good Regularization Basically subsampling and exhaustive search using a blind metric. First, use a subset of training data, and loose accuracy or quick iteration deadline for local search Later, use the whole training data, and tighter accuracy, but only once. More details to be published in the Proceedings of ICANNGA 09 (LNCS series).

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Heuristic Search for a Good Regularization Figure: Progression of cost and acc. with varying β.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Diffusion Geometry Framework Construct a graph G = (X, W ) on a dataset with a kernel W = w ( x i, x j ) = e x i x j 2 /2ε (which is only one of possible choices). Then form a Markov matrix M = {m ( x i, x j ) } as m ( x i, x j ) = w ( x i, x j ) d (x i ) where d (x i ) = x j X w ( x i, x j ). By computing the eigenvalues of a conjugate matrix A given by a(x i, x j ) = 1 d(x i )m(x i, x j ). and using a couple of further d(xj ) steps, one obtains an embedding of the form Ψ t (x i ) = ( λ t 1 ψ 1(x i ), λ t 2 ψ 2(x i ), λ t 3 ψ 3(x i ),... ), (5) that represents the dataset in a new Euclidean space.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS Diffusion Geometry Framework Figure: The original geometry is mapped as a butterfly shaped set, in which the red and blue phases are organized according to the diffusion they generate: the cord length between two points in the diffusion space measures the quantity of heat that can travel between these points.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS An Example Application The case Wine from the UCI Machine Learning Repository: classification of wines into 3 categories based on 13 chemical measurements. (a) ω = 1 (b) ω = 0.9 Figure: The 142 wine samples embedded using DM. The algorithm parameter ω controls the strength of labeled class separation in the embedding.

Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS What we plan to do with the methods Hybridization/piping of the methods: Visualize the datasets using DG projections Make the clustering and classification algorithms faster by pre-processing (to lower dimension) with DG? Incorporate missing data handling to all the methods. (Now implemented only for clustering.)

Project Status, Relation to Other Research A Side Note About Openness of Algorithms Outline 1 The RISC-PROS project Data Mining Clustering, Classification, Dimension Reduction RISC-PROS The Industrial Setting 2 Background 1/3: Robust Clustering Background 2/3: Robust MLP Training Background 3/3: Diffusion Geometries Plans of Method Development within RISC-PROS 3 Project Status, Relation to Other Research A Side Note About Openness of Algorithms

Project Status, Relation to Other Research A Side Note About Openness of Algorithms Current State of Affairs in RISC-PROS First report written; waiting for comments from the companies; then to publish in the dept. report series. Datasets have been aquired from the companies. This tends to take time! As of yet, we don t have enough information about the datasets and the actual goals that the companies have (and that we can help with, using our methodology) The plan is now to just do some computations and show some visualized results to the company people; we assume that seeing at least something will make it easier to get ideas about what the ultimate goals are.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms Collaboration Opportunities Let me list some completely different topics that could be related to what is happening in RISC-PROS: We plan to try population-based optimization (genetic/memetic) with our cost function formulations in the future. Visualization methods developed originally for interactive multiobjective optimization could be useful for examining the reduced-dimension data. (And vice versa: suitable dimension reduction could make MOO visualizations more intuitive)? We already have ideas of incorporating, in a way, a multi-grid methodology in the MLP training.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms A Concern What happens to the method implementations I create while doing my work in an industrially inclined project? AFAIK, the university harvests all intellecual rights properties to whatever computer programs I write. It is simpler for those of us who are not controlled by outside funding; the researcher decides upon his/her code. But how many of us are? And does everybody who works here even know who owns their codes? My fear is that after some idea has been tried a couple of times in a project, the implementation is forgotten and needs to be re-made by the next people (or the whole idea gets forgotten if the project didn t produce visible-enough reports and journal articles)

Project Status, Relation to Other Research A Side Note About Openness of Algorithms More Concerns Not all that is produced, especially in implementations, can be used in journals (we don t think source code is the research result to be published). But should a program be lost forever even if not published? Consider the research softwares WaveLab and WEKA (first ones that come to my mind): Would they be so popular and recognized (or even as good) as they currently are, if they were closed-source property of the researchers or universities? Also, WaveLab and Reproducible Research by Buckheit and Donoho was an inspiring reading for me; I recommend the paper warmly, if you haven t read it already.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms My Personal Salvation For the RISC-PROS project The agreement states that all implementations of universally applicable mathematical algorithms created as part of RISC-PROS shall be published publically as open source under the MIT License. For example, the implementations of ideas published in the forthcoming ICANNGA 09 proceedings. Naturally the agreement states that any code revealing industrial trade secrets shall not be made public To make the University IPR people happy, we state application code directly integrable in an industrial partner s system shall not be open source. (The last, unfortunate, pain that lessens my programming and overall motivation at work.)

Project Status, Relation to Other Research A Side Note About Openness of Algorithms My Personal Salvation What I hope to be gained with open-sourcing the RISC-PROS method implementations? I assume it will add to the research group s publicity. Definitely making something more public will not have a negative effect on publicity! As for the concern about pre-mature publishing... If there are mistakes, maybe someone will spot them, and they get corrected before our next paper. I can take the blow of having computed wrongly; I m sure everybody does that sometimes by accident. The anticipation of publicity has already made the codes a little bit better quality than what they had been if it was just for the one project.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms One Project Is Not a General Solution Note that, as of yet, the RISC-PROS source codes are not published... We are still thinking about what is a good format, a publishing forum, and time for going public with the first version. It would be nice to have a general recommended practice about this something that everybody knows about, and uses daily. Well, now we are attempting to build one!

Project Status, Relation to Other Research A Side Note About Openness of Algorithms A More General Solution We have initiated a loose group (managed by Tero Tuovinen) to discuss the following agenda: Create a common source code database (with version control tools) where the mit.jyu.fi department staff would store their research source codes, and share them with others, maybe even the whole world. Possibility, maybe recommendation, not obligation! We are also aware of some difficulties, such as issues with IPRs & licenses and fears of pre-mature publicity. At the moment, technical implementation issues are being investigated. You re welcome to bring your input and ideas to our attention. More shall be announced.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms Acknowledgments Former research on clustering by Sami Äyrämö. Research is guided by Tommi Kärkkäinen and Amir Averbuch. The RISC-PROS project is funded by the European Union, European Regional Development Fund via Tekes.

Project Status, Relation to Other Research A Side Note About Openness of Algorithms Summary In this talk, I presented: The EU-funded project RISC-PROS that I work with Some tasks in Data Mining, and specific methods being developed at our department Standing on barricades about open source software, and especially the benefits of producing such as part of scientific research.