Data Mining and Visualization



Similar documents
Get The Picture: Visualizing Financial Data part 1

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Exploration Data Visualization

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COC131 Data Mining - Clustering

Instructions for SPSS 21

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Diagrams and Graphs of Statistical Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Visualization Handbook

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Tutorial Segmentation and Classification

The Delicate Art of Flower Classification

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Step 3: Go to Column C. Use the function AVERAGE to calculate the mean values of n = 5. Column C is the column of the means.

Using Excel for descriptive statistics

Data Mining mit der JMSL Numerical Library for Java Applications

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Visualization Quick Guide

Exercise 1.12 (Pg )

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Summarizing and Displaying Categorical Data

Visualization methods for patent data

Neural Networks Lesson 5 - Cluster Analysis

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

An Introduction to Data Mining

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

MTH 140 Statistics Videos

Drawing a histogram using Excel

Data Visualization. or Graphical Data Presentation. Jerzy Stefanowski Instytut Informatyki

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Figure 1. An embedded chart on a worksheet.

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

An Overview of Knowledge Discovery Database and Data mining Techniques

IBM SPSS Direct Marketing 23

How To Cluster

A fast, powerful data mining workbench designed for small to midsize organizations

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Data representation and analysis in Excel

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

IBM SPSS Direct Marketing 22

Data analysis process

Hierarchical Clustering Analysis

Using multiple models: Bagging, Boosting, Ensembles, Forests

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION


What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Systat: Statistical Visualization Software

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Statistical Distributions in Astronomy

Exploratory data analysis (Chapter 2) Fall 2011

Predictive Dynamix Inc

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Data exploration with Microsoft Excel: analysing more than one variable

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Excel Tutorial. Bio 150B Excel Tutorial 1

Microsoft Business Intelligence Visualization Comparisons by Tool

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Colour Image Segmentation Technique for Screen Printing

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Univariate Regression

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

3F3: Signal and Pattern Processing

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Azure Machine Learning, SQL Data Mining and R

Data Mining III: Numeric Estimation

Tutorial for proteome data analysis using the Perseus software platform

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

SPSS Manual for Introductory Applied Statistics: A Variable Approach

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Section 3 Part 1. Relationships between two numerical variables

Data Mining. Dr. Saed Sayad. University of Toronto

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Transcription:

Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford

Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research Statistics and visualization in Excel What s the problem? 2

Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research Statistics and visualization in Excel What s the problem? 3

NAG Data Mining Components (DMC) Data Cleaning Data imputation adding missing values Outlier detection finding suspect data records Data Transformation Scaling Data before distance computation Principal Component Analysis reducing # of variables Cluster Analysis k-means analyst decides # of clusters in data Hierarchical stepwise agglomeration of data 4

DMC: Classification techniques Classification Trees Two types of decision tree: binary (Gini index) n-ary (entropy-based) Generalized Linear Models Fitting of Binomial distribution (for binary classification tasks) Poisson distribution (for count data) k-nearest Neighbours Predict values using k most similar records in a training dataset Set prior probabilities for data classes Also used for regression (see below) 5

DMC: Regression techniques Regression Trees Minimise sum of squares about mean robust estimate of the mean, or sample average Linear Regression Automatic selection of model variables Multi-Layer Perceptron Neural Networks Flexible non-linear models Free parameters in MLP optimised using conjugate gradients Nearest Neighbours (see above) Radial Basis Function Models function of distance from centre location to data records 6

DMC: other techniques Association rules Determine relationships between nominal data values Utility functions Random number generators Rank ordering Sorting Mean and sum of squares updates Two-way classification comparison Save and load models 7

Example application: Quality control Detection of changes in sample due to e.g. heating Use circular dichroism spectroscopy measures difference in absorbance by left & right polarized light Generates spectrum for each sample Intensity vs wavelength Some spectra look similar, others don t How to classify them? 8

Spectra display 9

Classification 36 spectra, 152 intensity values each Read into 36 x 152 matrix Passed to hierarchical cluster analysis routines Euclidean distances between data points Average link distances between clusters Output displayed as dendogram tree plot showing merging of clusters with distance Introduce a cut-off to define natural clusters 10

Analysis Cut off gives seven natural clusters not v. sensitive to distance functions Some of the results can be understood w.r.t experimental conditions e.g. 030220a1 - concentrated sample (evaporation) e.g. 030319g5 to 030330f5 - repeatable experiment But there are some outliers e.g. 030326e1, 030326e2 in normals needs consultation with domain experts 12

Example application: Classification Fisher s iris dataset 4 measurements made on 50 iris specimens from each of three species petal length, petal width, sepal length, sepal width How to classify the species? 150 data points Each point has 4 independent variables belongs in one of three classes red, green, blue How to display dataset? 13

2D scatterplots

Scatterplots? In 2D, need 6 plots to show each variable vs every other one Need to consider them all at the same time Can reduce the number of variables using principal components analysis first three explain ~95% of the variance Scatterplot in 3D 15

3D scatterplot

3D scatterplot

3D scatterplot

3D scatterplot

Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research Statistics and visualization in Excel What s the problem? 20

Example: Visual datamining Dutch financial asset management company Keen to target products at entrepreneurs How does entrepreneurship relate to other customer characteristics? Marketing dataset 25,000 customers (sampled from full customer base) Each characterised by values for 100 variables Investment history = entrepreneurship Income Age 21

Select strongly correlated variables Correlation matrix 22

Select strongly correlated variables Correlation matrix Histogram from cluster analysis Select variables of interest 23

Select strongly correlated variables Scatter plot Correlation matrix Histogram from cluster analysis Select variables of interest 24

Lessons learnt 3D correlation landscape useful Identifying significant variables Focus on data distributions Select appropriate ranges for cluster analysis Cluster visualization helpful Non-linear relationships in data revealed 3D visualization combined with direct interaction Selection of correlated variables Binning and sorting Done with IRIS Explorer 25

Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research Statistics and visualization in Excel What s the problem? 26

Simple visualization: shoe size survey Shoe size 5 6 7 8 9 10 11 12 Frequency 3 5 14 13 19 17 5 1 Visualize using Excel Chart wizard creates visualization easily Interactive control over appearance Colours, line width, text, fonts, placement Data and visualization linked together 27

Second survey (now with half-sizes) Shoe size 5 6 7 8 9 10 11 12 5.5 7.5 9.5 Frequency 3 5 14 13 19 17 5 1 2 8 8 29

What s wrong with this picture? Ordering of values on X axis reflects order in spreadsheet not numerical order Spacing on X axis doesn t reflect difference between values spacing is everywhere the same Could pay close attention to labels But might be harder for more complex data Obscures missing data 32

Another example adsorption isotherm Measurement of fluid density inside porous solid as a function of fluid pressure Confined fluid can condense before saturation Capillary condensation Vertical jump in isotherm Plot data using Scatter plot Line graph 33

Scatter plot

Line graph

Another example derivative calculation Black-Scholes modelling of options on swap agreements ( swaptions ) Option volatility as a function of Swap maturity Option expiry time Display in Excel as a surface plot 36

Irregular spacing

What s wrong with this picture? See above Irregular spacing, discontinuities This is only one part of the dataset Volatility also depends on strike value Want plots at other strike values Want to see other relationships e.g. volatility ( strike ) = volatility smile Use IRIS Explorer 38

Irregular spacing

Two types of axis in Excel charts Value Data treated as continuously varying numerical values Marker placed at location reflecting its value Used in Excel Scatter plots Category Data treated as sequence of non-numerical text labels Marker location reflects position in sequence Points distributed evenly along axis Used in Excel Bar chart, Line chart, Surface plot 42

NAG Schools Excel Add-in (N-SEA) Supports instruction in statistics Functionality for Data sampling Frequency plots Box and whisker plots Histograms Continuous bar charts Allows (X/Y) ordering of data points in plotting and inclusion of points with zero Y values 43

Original

N-SEA

Conclusions Data mining components offer basic routines Developers can incorporate them into applications No wheel-reinvention, stone canoes, chocolate teapots cf NAG numerical library Visualization is crucial for analysis Integration of data mining & visualization is applicationdependent Interactivity important Problems with (even) well-known tools Be aware Work around 47