Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications

Similar documents
Statistical Distributions in Astronomy

Principal components analysis

Data, Measurements, Features

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Unsupervised and supervised dimension reduction: Algorithms and connections

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Manifold Learning Examples PCA, LLE and ISOMAP

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Nonlinear Iterative Partial Least Squares Method

The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Component Ordering in Independent Component Analysis Based on Data Power

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

The Scientific Data Mining Process

CS Introduction to Data Mining Instructor: Abdullah Mueen

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Galaxy Morphological Classification

Supervised and unsupervised learning - 1

Machine Learning for Data Science (CS4786) Lecture 1

Support Vector Machines with Clustering for Training with Very Large Datasets

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Using multiple models: Bagging, Boosting, Ensembles, Forests

How To Cluster

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

: Introduction to Machine Learning Dr. Rita Osadchy

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Unsupervised Data Mining (Clustering)

Part 2: Community Detection

Least-Squares Intersection of Lines

ADVANCED MACHINE LEARNING. Introduction

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Multiple regression - Matrices

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Multidimensional data analysis

DATA ANALYSIS II. Matrix Algorithms

Statistical Models in Data Mining

SYMMETRIC EIGENFACES MILI I. SHAH

Principal Component Analysis

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Object Recognition and Template Matching

Introduction to Matrix Algebra

Chapter 6. Orthogonality

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Social Media Mining. Data Mining Essentials

Standardization and Its Effects on K-Means Clustering Algorithm

Big Data Analytics CSCI 4030

Introduction to Machine Learning Using Python. Vikram Kamath

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Cluster Analysis. Chapter. Chapter Outline. What You Will Learn in This Chapter

Collaborative Filtering. Radek Pelánek

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Going Big in Data Dimensionality:

MACHINE LEARNING IN HIGH ENERGY PHYSICS

OUTLIER ANALYSIS. Data Mining 1

Data Exploration Data Visualization

Neural Networks Lesson 5 - Cluster Analysis

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Statistical Machine Learning

So which is the best?

Several Views of Support Vector Machines

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Contributions to high dimensional statistical learning

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

NOTES ON LINEAR TRANSFORMATIONS

Performance Metrics for Graph Mining Tasks

Data Mining Practical Machine Learning Tools and Techniques

Orthogonal Projections and Orthonormal Bases

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Visualization by Linear Projections as Information Retrieval

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Anomaly Detection and Predictive Maintenance

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

An Introduction to Machine Learning

Introduction: Overview of Kernel Methods

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Linear Algebra Review. Vectors

Machine Learning and Pattern Recognition Logistic Regression

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Machine Learning with MATLAB David Willingham Application Engineer

TIETS34 Seminar: Data Mining on Biometric identification

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

Introduction to Data Mining

Methods and Applications for Distance Based ANN Training

α = u v. In other words, Orthogonal Projection

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Novelty Detection in image recognition using IRF Neural Networks properties

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Transcription:

Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518

Human are Great Pattern Recognizers Sensors: look, smell, touch, hear Computation: 100 billions (10 11 ) neurons Estimated 300 million pattern recognizers (How to create a mind, Kurzweil)

Machine Learning We want computers to perform tasks. It is difficult for computers to learn like the human do. We use algorithms: Supervised, e.g.: Classification Regression Unsupervised, e.g.: Density Estimation Clustering Dimension Reduction

Machine Learning We want computers to perform tasks. It is difficult for computers to learn like the human do. We use algorithms: Supervised, e.g.: Classification Regression Unsupervised, e.g.: Density Estimation Clustering Dimension Reduction

From Data to Information We don t just want data. We want information from the data. Information Database Sensors Machine Learning Not Data Mining by Human

Expert could be Biased (Thinking Fast and Slow, Kahneman)

Expert could be Biased (Thinking Fast and Slow, Kahneman)

Applications of Unsupervised Learning Consumer Clustering Human Network Analysis Galaxies Grouping and more

Basic Concepts in Machine Learning Label and Unlabeled data Datasets: training set and test set Feature space Distance between points Cost Function (or called Error Function) Shape of the data distribution Outliers

Basic Concepts in Machine Learning Label and Unlabeled data Datasets: training set and test set Feature space Distance between points Cost Function (or called Error Function) Shape of the data distribution Outliers

Feature Space The raw data may not be immediately suitable for pattern recognition. Feature space is spanned by the resultant data after pre-processing the raw data. The data usually N points (or vectors), each has M variables (or components). Galaxy Spectrum: Raw Data Galaxy Spectrum: After Pre-processing There are systematics: Cosmic rays Sky lines Remove systematics. Resample the number of wavelength bins (M) to be 4000.

Intuition to High-Dimensional Data: N points, M components N = 10, M = 3 N, M Z Z N Y 2 3 Y X R 3 X 1 R M

Unlabeled vs. Labeled Data Labeled data has an extra label compared with the unlabeled. Data labeling can be expensive (e.g., human expert).

Labeled Data x 2 X X X X x 1

Unlabeled Data x 2 Data Points x 1

Galaxy Images are Unlabeled Data (well, before Labeling) (SDSS images of CALIFA DR1; Husemann et al. 2012, Sanchez et al. 2012, Walcher et al. in prep.)

Galaxy Zoo Project Web users classify galaxy morphologies. Use majority vote to decide on the answer.

Hubble s Classification of Galaxy Morphologies E0, E2, S0, etc. are the labels.

Galaxy Images are Unlabeled Data (well, before Labeling) S S E S S S S S S S S E E etc. (SDSS images of CALIFA DR1; Husemann et al. 2012, Sanchez et al. 2012, Walcher et al. in prep.)

Digit Recognition Labels: 0 1 2 3 4 5 6 7 8 9

Distance between 2 Points in Space The set of data points spans a space. We can measure the distance between 2 points in such space. E.g., Chi-sq measures the sum of distance squared between a point and model. If the points are close to each other: They are neighbors with similar features. The best definition of distance (to yield concepts like very close and far away ) is datadependent.

Cost Function In many machine learning algorithms, the idea is to find the model parameters θ which minimize the cost function J θ : J θ = 1 m m i=1 model data i data i 2 m is the size of the dataset or the training set. That is, we want model as close to data as possible. Note that model depends on θ.

Recall: LSQ Fitting

Recall: LSQ Fitting Chi-sq is one example of cost function.

Distance between Points is Non-trivial: S Curve 3D Space 2D Embedding Space Y X (Vanderplas & Connolly, 2009)

Unsupervised vs. Supervised Learning Unsupervised: Given data {x 1, x 2, x 3,, x n } find patterns. The description of a pattern may come in the form of a function (say, g x ). Supervised: Given data {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),, (x n, y n )} find a function such that f x = y. y are the labels.

Some Areas in Unsupervised Learning Density Estimation Kernel Density Estimation Mixture of Gaussians Clustering K Nearest Neighbor Dimension Reduction Principal Component Analysis (Linear Technique) Locally Linear Embedding (Non-Linear Technique)

Unsupervised Learning: Find Patterns x 2 in the Unlabeled Data x 1

Unsupervised Learning: Find Patterns x 2 in the Unlabeled Data x 1

Basis Function Basis function allows us to parameterize the data in some handy or/and meaningful ways. The decomposition of a data point into basis function is usually quick. Examples: Gaussian functions Kernel functions Eigenfunctions (such as Eigenspectra in Astronomical Spectroscopy)

Density Estimation: Kernel Density Estimation Approximate a distribution by the sum of kernels (K s) of various bandwidths (h). Properties of kernel: K is a probabilty density function, K u > 0 and K u du = 1. K is symmetric around zero: K u = K u (Not Always) K u = 0 for u > 1

Example Kernel Functions K(u) K(u) K( u > 1) 0 u (Dekking et al. 2005)

Constructing the Kernel Density Estimate 1 h K u h f = 1 nh n i=1 K u xi h Note f u du = 1

Spatial Volume Density (Higher ) Luminosity Function of Nearby SDSS Galaxies Notice the height of the kernels actually vary, slightly different from the discussed kernel density estimation. (Blanton et al. 2003) Magnitude of Galaxy (Dimmer )

Number/Total Number Color Distribution of Nearby Galaxies Color (Redder ) (Yip, Connolly, Szalay, et al. 2004)

Principal Component Analysis Perhaps the most used technique to parameterize high-dimensional data. Best for linear data distribution. Many applications in both academia and industry: dimension reduction, data parameterization, classification problems, image decomposition, audio signal separation,, etc.

High-Dimensional Data may lie in Lower-Dimensional Space (or Manifold) N points, M components Z Y X R M PCA finds the orthogonal directions in the data space which encapsulate maximum sample variances.

High-Dimensional Data may lie in Lower-Dimensional Space (or Manifold) N points, M components Z 2 nd Principal Component 1 st Principal Y Component X R M PCA finds the orthogonal directions in the data space which encapsulate maximum sample variances. A principal component is also called eigenvector, or eigenfunction.

High-Dimensional Data may lie in Lower-Dimensional Space (or Manifold) N points, M components Z 2 nd Principal Component 1 st Principal Y Component X R M We reduce the dimension of the problem from M to 2. PCA finds the orthogonal directions in the data space which encapsulate maximum sample variances. A principal component is also called eigenvector, or eigenfunction.

Properties of PCA PCA decomposes the data into a set of eigenfunctions. The eigenfunctions are orthogonal (perpendicular) to each other: The dot product between two eigenvectors is zero. The set of eigenfunctions maximize the sample variance of the data. Therefore, the data can be decomposed into a handful of eigenfunctions (for linear data distribution). For non-linear data, PCA may fail (i.e., we need many orders of eigenfunctions). In galaxy spectroscopy, the basis functions are called eigenspectra (Connolly et al. 1995).

PCA Eigenspectra (e iλ ) Representation of Galaxy Spectra f λ = i a i e iλ (i runs from 1 to the number of eigenspectra) Minimize reconstruction error with respect to a i s: x 2 = λ w λ f λ a i e 2 i iλ Get the weights for each basis function: a i = a i (w λ, e iλ, f λ ) (Connolly & Szalay 99)

Flux (erg/s/cm 2 /Å) Eigenspectra of Nearby Galaxies (SDSS: N = 170,000; M = 4000) Wavelength (Å)

PCA as a way for Signal Separation The basis functions has physical meanings. Hence each order (or called mode ) of function can be considered as a signal. In PCA, the basis functions are orthogonal. They point to different direction in the data space and are statistically independent.

Eigenspectra of Half a Million Galaxy Spectra 2nd mode: galaxy type (steepness of the spectral slope) 1st 2nd

Eigenspectra of Half a Million Galaxy Spectra 3 rd mode: post-starburst activities (stronger the absorptions, weaker the H ) Absorptions 1st H 3rd

Image Reconstruction Original Reconstruction using 50 Eigenfaces (Everson & Sirovich 1995)

Matrix Representation of Data Many datasets are made up with N objects and M variables. Matrix provides a handy way to represent the data. An added advantage is that many algorithms can be expressed conveniently in matrix form.

Wavelength Example: Size M x N Data Matrix for Galaxy Spectra A ij = Flux at the ith wavelength for the jth galaxy Galaxy ID A = A 11 A 12 A 13 A 1N A 21 A 22 A 23 A 2N A M1 A M2 A M3 A MN * PCA in matrix notation: see handout.

Compressing the Object Space: CUR Matrix Decomposition In PCA, the number of components in the data vectors remain intact. CUR Matrix Decomposition provides a dramatically new way to compress big data. This approach compresses the variable space: The number of components in the data vectors decreases.

Some Wavelengths are More Informative: Leverage Score per Each Variable Sample Variance Flux(λ 3 ) Sparsity Flux(λ 6 ) Flux(λ 2 ) Flux(λ 5 ) Flux(λ 1 ) Flux(λ 4 )

Galaxy ID Galaxy ID Selected Wavelength Selected Galaxy ID CUR Matrix Decomposition (Mahoney & Drineas 2009) Wavelength Selected Wavelength Wavelength Selected Galaxy ID A = C U R CUR approximates data matrix: min A CUR F

Frobenius Norm A matrix norm is a number for representing the amplitude of a matrix. The Frobenius norm is defined as follows: M A F = A ij 2 i=1 In CUR Matrix Decomposition, the matrix A is the difference between the data matrix and its approximated matrix. The Frobenius norm therefore measures the distance between the two matrices. N i=1

Find Important Regions in Multi-Dimensional Data: Galaxy Spectra

Discussion of Jan 7 Homework Total number of pixels in the CCD = 1024 x 1024 pixels = 1,048,576 pixels ~ 1 MPixels. GAIA needs 10 9 /60/60/24/365 years = 31.7 years =31 years 8 months ~ 32 years.