Data, Measurements, Features



Similar documents
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Statistics Graduate Courses

Learning outcomes. Knowledge and understanding. Competence and skills

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining Part 5. Prediction

Unsupervised and supervised dimension reduction: Algorithms and connections

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

The Scientific Data Mining Process

Statistical Models in Data Mining

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Principles of Data Mining by Hand&Mannila&Smyth

Learning is a very general term denoting the way in which agents:

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Why do statisticians "hate" us?

Data Mining - Evaluation of Classifiers

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

ANALYTICS IN BIG DATA ERA

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining and Machine Learning in Bioinformatics

Introduction to Data Mining

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Mining: Overview. What is Data Mining?

ADVANCED MACHINE LEARNING. Introduction

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Social Media Mining. Data Mining Essentials

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

II. DISTRIBUTIONS distribution normal distribution. standard scores

Supervised and unsupervised learning - 1

Introduction to Pattern Recognition

How To Understand The Theory Of Probability

Customer Classification And Prediction Based On Data Mining Technique

MA2823: Foundations of Machine Learning

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

A Review of Data Mining Techniques

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Data Mining: An Overview. David Madigan

Statistical Challenges with Big Data in Management Science

Statistics for BIG data

Information Management course

Introduction. A. Bellaachia Page: 1

How To Cluster

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Supervised Feature Selection & Unsupervised Dimensionality Reduction

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Final Project Report

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering & Visualization

: Introduction to Machine Learning Dr. Rita Osadchy

Unsupervised Data Mining (Clustering)

Introduction to Data Mining

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Department of Economics

Steven C.H. Hoi School of Information Systems Singapore Management University

Exploratory Data Analysis

Additional sources Compilation of sources:

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Predict the Popularity of YouTube Videos Using Early View Data

OUTLIER ANALYSIS. Data Mining 1

Machine Learning Introduction

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Azure Machine Learning, SQL Data Mining and R

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

An Overview of Knowledge Discovery Database and Data mining Techniques

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Introduction to Machine Learning Using Python. Vikram Kamath

A Statistical Text Mining Method for Patent Analysis

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Bayesian networks - Time-series models - Apache Spark & Scala

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Data Mining for Business Analytics

Fairfield Public Schools

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Machine Learning for Data Science (CS4786) Lecture 1

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

CS Introduction to Data Mining Instructor: Abdullah Mueen

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Environmental Remote Sensing GEOG 2021

MACHINE LEARNING IN HIGH ENERGY PHYSICS

E3: PROBABILITY AND STATISTICS lecture notes

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Multivariate Normal Distribution

Tutorial for proteome data analysis using the Perseus software platform

Transcription:

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay

What do you think of when someone says Data? We might abstract the idea that data are information not yet in the form we want it, and therefore needing nontrivial processing. Moreover, the information is incomplete, through errors or lack of some measurements, so probable reconstructions of the incomplete parts are also desired.

In general, data consists of propositions that reflect reality. A large class of practically important propositions are measurements or observations of a variable. Such propositions may comprise numbers, words, or images.

Features Objects: physical entities (images, patients, clients, molecules, cars, signal samples, software pieces), or states of physical entities (board states, patient states etc). Features: measurements or evaluation of some object properties. Example: Are pixel intensities good features? No - not invariant to translation/scaling/rotation. Better: type of connections, type of lines, number of lines, etc. Selecting good features, transforming raw measurements that are collected is very important.

Measurements are no Features 16 x 16 = 256 measurements describe the object Number of features (e.g. moments, endpoints, strokes, holes) may be much smaller. They represent the object. (Duin)

Data representation Traditional algorithms work on vectors. Images can be represented as matrices or vectors. Abstract data Graphs Sequences 3D structures

Possible Object Representations Duin etc.

Feature space representation Representation: mapping objects into vectors, {O i } => X(O i ), with X j (O i ) being j-th attribute of object O i attribute and feature are used as synonyms Types of features. Categorical: symbolic or discrete may be nominal (unordered), like sweet, salty, sour, or ordinal (can be ordered), like colors or small < medium < large (drink). Continuous: numerical values. Vector X =(x 1,x 2,x 3... x p ), or a p-dimensional point in the feature space. x 3 x x 2 x 1

Features representing the same entity are combined in the feature vector v = [v 1 v 2 v p ] T p very important since it determines the computational effort has a strong impact on the requirements on the learning set size (also statistical significance)

v points to one certain point in the p- dimensional space V Every point in measurement space V corresponds to one of the possible constellations of the data

Different type of features may be arbitrarily mixed. Ordering scheme of the components of v is arbitrary but must be fixed

Which features to choose -> design decision It should be decided by human insight into the field of application Features should have the potential of giving hints as to which class the observed event belongs How to extract powerful (?) feature sets from larger sets of feature candidates? Discriminative power of a feature set can be improved by properly chosen transformations, e.g. Normalization Remark measurements are just a set of samples: formed as a result of random selection of some representatives of the set

Methods

Statistics Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. Patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics can be considered part of applied statistics. There is also a discipline of mathematical statistics, which is concerned with the theoretical basis of the subject. The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in employment statistics, accident statistics, etc.

Statistics In applying statistics to a problem, one begins with a process or population to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period. It may instead be a process observed at various times; data collected about this kind of "population" constitute what is called a time series. For practical reasons, rather than compiling data about an entire population, one usually instead studies a chosen subset of the population, called a sample. Data are collected about the sample in an observational or experimental setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference.

Statistics Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), forecasting of future observations, descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining.

Univariate multivariate

Summary of a multivariate data set Summaries for each of the variables separately Summaries for the relationships between (pair of) variables

Mean Variance Summaries for each of the variables separately

Summaries for the relationships between (pair of) variables Variance Covariance Correlation

Analysis of Data Exploratory Analysis exploration: attempts to recognize any non-random pattern or structure mining: generates possible interesting hypotheses for further study Confirmatory Analysis After well-defined hypothesis in mind Some type of (well-known) significance test

Search for structure or pattern in the data If pattern arises from the fact that we have measurements on similar group of subjects unsupervised pattern recognition or unsupervised learning But What are these groups? How many groups are there? Which subject belongs to which group?

Other motivations Find latent variables Supervised learning Regression But, as usual no systematic methodology

What is machine learning? Machine learning is the study of computer systems that improve their performance through experience. Learn existing and known structures and rules. Discover new findings and structures. Face recognition Bioinformatics Supervised learning vs. unsupervised learning Semi-supervised learning

Duin etc.

Compactness Hypothesis

Machine learning applications Bioinformatics: Hugh amount of biological data from the human genome project and human proteomics initiative. Goal: Understanding of biological systems at the molecular level from diverse sources of biological data. Challenge: Scalability, multiple sources, abstract data. Applications: Microarray data analysis, Protein classification, Mass spectrometry data analysis, Protein-protein interaction. Others: Computer vision, information retrieval, image processing, text mining, web mining, etc.

Supervised vs. unsupervised learning Although unsupervised learning methods may appear to have limited capabilities, there are several reasons that make them extremely useful Labeling large data sets can be a costly procedure (i.e., speech recognition) Class labels may not be known beforehand (i.e., data mining) Large datasets can be compressed by finding a small set of prototypes (knn) The supervised and unsupervised paradigms comprise the vast majority of pattern recognition problems A third approach, known as reinforcement learning, uses a reward signal (realvalued or binary) to tell the learning system how well it is performing In reinforcement learning, the goal of the learning system (or agent) is to learn a mapping from states onto actions (an action policy) that maximizes the total reward

Curse of dimensionality: The problem occurs when searching in or estimating density on high-dimensional spaces. 1. Computation: The complexity grows exponentially with the dimension, rapidly outstripping the computational and memory storage capabilities of computers. 2. Estimation: The problem of estimating a density function on a high-dimensional space may be seen as determining the density at each cell in a multidimensional grid. Given a fixed number of K grid lines per dimension, the number of independent cells grows as KP where P is the dimension.

Curse of dimensionality Large sample size is required for high-dimensional data. Query accuracy and efficiency degrade rapidly as the dimension increases. Strategies Feature reduction Feature selection Manifold learning Kernel learning