ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION



Similar documents
Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Chapter 7. Cluster Analysis

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Nodes, Ties and Influence

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

How To Cluster

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Machine Learning for Data Science (CS4786) Lecture 1

Cluster Analysis: Advanced Concepts

Data Preprocessing. Week 2

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Neural Networks Lesson 5 - Cluster Analysis

Social Media Mining. Data Mining Essentials

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Chapter ML:XI (continued)

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Using Data Mining for Mobile Communication Clustering and Characterization

TIETS34 Seminar: Data Mining on Biometric identification

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Machine Learning.

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Principles of Data Mining by Hand&Mannila&Smyth

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Final Project Report

CS Data Science and Visualization Spring 2016

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Big Data in medical image processing

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Machine Learning using MapReduce

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Clustering Connectionist and Statistical Language Processing

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Data Mining Part 5. Prediction

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Linear Threshold Units

Customer Classification And Prediction Based On Data Mining Technique

Environmental Remote Sensing GEOG 2021

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Data Mining: Data Preprocessing. I211: Information infrastructure II

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Unsupervised learning: Clustering

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Applying Machine Learning to Network Security Monitoring. Alex Pinto Chief Data Scien2st

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

SoSe 2014: M-TANI: Big Data Analytics

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

A comparison of various clustering methods and algorithms in data mining

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Standardization and Its Effects on K-Means Clustering Algorithm

Using multiple models: Bagging, Boosting, Ensembles, Forests

Clustering Technique in Data Mining for Text Documents

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

SBML SBGN SBML Just my 2 cents. Alice C. Villéger COMBINE 2010

Going Big in Data Dimensionality:

Unsupervised Data Mining (Clustering)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining: Algorithms and Applications Matrix Math Review

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Introduction to Data Mining

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Performance Metrics for Graph Mining Tasks

Clustering UE 141 Spring 2013

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Transcription:

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION CSE 537 Ar@ficial Intelligence Professor Anita Wasilewska GROUP 2 TEAM MEMBERS: SAEED BOOR BOOR - 110564337 SHIH- YU TSAI - 110385129 HAN LI 110168054

SOURCES Slides 9 10: Visualiza@on and Visual Analy@cs, Fundamental Tasks, Klaus Muller, SBU Slides 13 17: Clustering, Eamonn Keogh, UC Riverside Slides 19 26 : Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9 Slides 28 30: Visualiza@on and Visual Analy@cs, Fundamental Tasks, Klaus Muller, SBU Extra Sources: Lecture Notes by Anita Wasilewska, Chapter 2: Preprocessing, Chapter 6: Classifica@on h\p://bl.ocks.org/ for types of visualiza@ons h\ps://www.youtube.com/watch?v=_awzggnrcic for K- Means algorithm

STEPS (PRESENTATION OUTLINE) Data Reduc@on Types of Visualiza@ons Data Mining Techniques Data Prepara@on (covered in class)

PROLOGUE Interac@on Defini@on of Visual Analy@cs : Data Science Data Mining Data Analy@cs Visualiza@on

PROLOGUE What: Analy@cal Techniques for Data Visualiza1on Why: so much data => visualize (see, perceivable) How: AI or Machine Learning Methods to read, sample, clustering K- means clustering algorithms/e- m algorithms/nearest neighbor Elimina@ng dimensions to 2D : Principal component analysis (PCA) Types of aesthe@c visualiza@ons

DATA CLEANING fill in missing values smooth noisy data iden@fy or remove outliers resolve inconsistencies

MISSING VALUES Data is not always available e. g, many tuples have no recorded value for several a\ributes, such as customer income in sales data Ways to solve: Ignore tuple Fill manually Use Global Constant A\ribute Mean Most probable value

NOISY DATA Why? faulty data collec@on instruments data entry problems data transmission problems technology limita@on inconsistency in naming conven@on What to do? Binning Method Clustering Regression Semi Automa@c: Computer Algorithm + Human input

DATA MINING TECHNIQUES ANALYZE DATA CLASSIFICATION - > REGRESSION - > CLUSTERING - > SIMILARITY MATCHING - > LINK PREDICTION

TASK # 1: CLASSIFICATION (COVERED IN CLASS) Predict which class a member of a certain popula@on belongs to absolute probabilis@c Requires Classifica@on Model Supervised Learning Unsupervised Learning Scoring with a Model each popula@on member gets a score for a par@cular class/category sort each class or member scores to assign scoring and classifica@on are related

TASK # 2: REGRESSION Regression = value es@ma@on Fit the data to a func@on oken linear, but does not have to be quality of fit is decisive Regression vs. classifica@on classifica@on predicts that something will happen regression predicts how much of it will happen

TASK # 3: CLUSTERING Group individuals in a popula@on together by their similarity Cluster methods K- means algorithms E- m algorithms

K-MEANS ALGORITHMS Input: K, set of points x 1 x n Place centroids c 1 c K at random posi@ons Repeat the following procedure un@l it converges For each x i Find nearest centroid c j Assign x i to cluster j For each cluster j New c j is the mean of all point x i in cluster j in previous step Source: h\ps://www.youtube.com/watch? v=_awzggnrcic

K-MEANS CLUSTERING: STEP 1 Algorithm: K-means, K =3, Distance Metric: Euclidean Distance 5 4 3 C 1 2 C 2 1 0 C 3 0 1 2 3 4 5

K-MEANS CLUSTERING: STEP 2 Algorithm: K-means, K =3, Distance Metric: Euclidean Distance 5 4 3 C 1 2 C 2 1 0 C 3 0 1 2 3 4 5

K-MEANS CLUSTERING: STEP 3 Algorithm: K-means, K =3, Distance Metric: Euclidean Distance 5 4 C 1 3 2 1 C 2 C 3 0 0 1 2 3 4 5

K-MEANS CLUSTERING: STEP 4 Algorithm: K-means, K =3, Distance Metric: Euclidean Distance 5 4 C 1 3 2 1 C 2 C 3 0 0 1 2 3 4 5

K-MEANS CLUSTERING: STEP 5 Algorithm: K-means, K =3, Distance Metric: Euclidean Distance 5 4 C 1 3 2 1 C 2 C 3 0 0 1 2 3 4 5

K- MEANS - COMMENTS Strength Rela%vely efficient: O(tKn), where t is # itera@ons. Normally, K, t << n. Oken terminates at a local op%mum. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify K, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes

E-M ALGORITHMS Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9 Some@mes K- mean is not intui@ve Horrible result from K- mean

EM APPROACH -- MIXTURE MODEL weighted sum of a number of pdfs where the weights are determined by a distribu@on Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

EM APPROACH -- ADD HIDDEN VARIABLES Suppose we are told that the mixture model is.. Observed data Hidden distribu@on Each point, for example mixture of gaussian Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

EM APPROACH -- ADD HIDDEN VARIABLES Suppose we are told that the mixture model is.. Log likelihood distribu@on for whole data set, n points Target: Use MLE to to get op@mal solu@on But hard to calculate, not concave or convex func@on Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

WE KNOW: WE HAVE: JENSEN S INEQUALITY When \ is constant, the equality holds, we get the boundary Then op@mize the boundary: that is what we can do easily, take the deriva@ve and equal to 0 Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

TOO ABSTRACT. Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

E-M ALGORITHMS Initialize K cluster centers Iterate between two steps Expectation step: assign points to clusters (Bayes) P ( di ck ) = wk Pr( di ck ) w j Pr( di c j ) w k = i Pr( d N i c k ) = probability of class c k Maximation step: estimate model parameters (optimization) m 1 dip( di ck ) µ k = m i = 1 P( di c j) k j probability that d i is in class c j

EM APPROACH -- ILLUSTRATION Source: Pa\ern Recogni@on and Machine Learning, Christopher M. Bishop, Chapter 9

TASK # 4: SIMILARITY MATCHING Picture from Klaus Muller

SIMILARITY MEASURE - DISTANCES Cosine Similarity

SIMILARITY MEASURES Jaccard Distance Picture from Klaus Muller

TASK # 5 : LINK PREDICTION Predict connec@ons between data items usually works within a graph predict missing links es@mate link strength Applica@ons in recommenda@on systems friend sugges@on in Facebook (social graph) link sugges@on in LinkedIn (professional graph) movie sugges@on in Nerlix (bipar@te graph people movies)

DATA REDUCTION

DATA REDUCTION Why? Tuples are mul@ dimensional Humans can see only 3 dimensions on the screen How? PCA Principal Component Analysis

PCA PRINCIPAL COMPONENT ANALYSIS Ul@mate goal: Steps: find a coordinate system that can represent the variance in the data with as few axes as possible First characterize the distribu@on by covariance matrix Cov correla@on matrix Corr lets call it C perform QR factoriza@on or LU decomposi@on to get : now order the Eigenvectors in terms of their Eigenvalues drop the axes that have the least variance

TYPES OF VISUALIZATION AESTHETIC AND CLEAR AND UNDERSTANDABLE GRAPHICS FOR REPRESENTATION OF DATA

COLLAPSIBLE TREE A standard tree, but one that is scalable to large hierarchies h\p://mbostock.github.io/d3/talk/20111018/tree.html

ZOOMABLE PARTITION LAYOUT A tree that is scalable and has par@al par@@on of unity h\p://mbostock.github.io/d3/talk/20111018/par@@on.html

COLLAPSIBLE FORCE LAYOUT Rela@onships within organiza@on members expressed as distance and proximity h\p://mbostock.github.io/d3/talk/20111116/force- collapsible.html

HIERARCHICAL EDGE BUNDLING Rela@onships of individual group members, also in terms of quan@ta@ve measures such as informa@on flow h\p://mbostock.github.io/d3/talk/20111116/bundle.html

CIRCLE MAPPING Quan@@es and containment, but not par@@on of unity h\p://mbostock.github.io/d3/talk/20111116/pack- hierarchy.html

DEMO

THANK YOU! QUESTION AND ANSWERS