MS1b Statistical Data Mining



Similar documents
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Learning outcomes. Knowledge and understanding. Competence and skills

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

An Introduction to Data Mining

Big Data and Marketing

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine learning for algo trading

Azure Machine Learning, SQL Data Mining and R

The Data Mining Process

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Machine Learning for Data Science (CS4786) Lecture 1

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Statistics W4240: Data Mining Columbia University Spring, 2014

Machine Learning with MATLAB David Willingham Application Engineer

Principles of Data Mining by Hand&Mannila&Smyth

Chapter 6. The stacking ensemble approach

Statistics for BIG data

Introduction to Data Mining Techniques

Bayesian networks - Time-series models - Apache Spark & Scala

Business Intelligence. Data Mining and Optimization for Decision Making

Big Data Big Knowledge?

Information Management course

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Classification algorithm in Data mining: An Overview

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

DATA MINING TECHNIQUES AND APPLICATIONS

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Machine Learning Introduction

Introduction to Data Mining

TDS - Socio-Environmental Data Science

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

ADVANCED MACHINE LEARNING. Introduction

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Course Syllabus. Purposes of Course:

Role of Social Networking in Marketing using Data Mining

CS Data Science and Visualization Spring 2016

Social Media Mining. Data Mining Essentials

Using Data Mining for Mobile Communication Clustering and Characterization

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

An Overview of Knowledge Discovery Database and Data mining Techniques

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

MD - Data Mining

Predictive Data modeling for health care: Comparative performance study of different prediction models

Big Data Analytics and Optimization

TNS EX A MINE BehaviourForecast Predictive Analytics for CRM. TNS Infratest Applied Marketing Science

Data Mining Practical Machine Learning Tools and Techniques

Comparison of Data Mining Techniques used for Financial Data Analysis

Big Data Analytics and Optimization

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining - Evaluation of Classifiers

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Statistics Graduate Courses

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Exploratory Data Analysis with MATLAB

Introduction to Data Mining

Data Mining. Nonlinear Classification

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

How To Make A Credit Risk Model For A Bank Account

Using multiple models: Bagging, Boosting, Ensembles, Forests

Machine Learning.

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Data Mining: Overview. What is Data Mining?

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Data Mining + Business Intelligence. Integration, Design and Implementation

REVIEW OF ENSEMBLE CLASSIFICATION

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

How To Perform An Ensemble Analysis

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Predictive Modeling and Big Data

Microsoft Azure Machine learning Algorithms

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

How To Understand Data Science

Keywords data mining, prediction techniques, decision making.

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Advanced In-Database Analytics

Introduction. A. Bellaachia Page: 1

Fast Analytics on Big Data with H20

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

MACHINE LEARNING BASICS WITH R

Monday Morning Data Mining

6.2.8 Neural networks for data mining

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Data Mining and Neural Networks in Stata

Transcription:

MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Course Structure Lectures Wednesdays 1100-1200, Weeks 1-8. Thursdays 1100-1200, Weeks 1,3,5,7. Problem Sheets 7 problem sheets: due Mondays at noon, Weeks 2-8. Part C students Practical classes: Thursdays 1100-1200, Weeks 2,4,6,8. Problem classes: Wednesdays time to be decided, Weeks 2-8. MSc students Miniproject: over Easter break.

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Syllabus I Part I: Dimensionality Reduction Principal Components Analysis Multidimensional Scaling Isomap Part II: Clustering Hierarchical clustering K-means Vector Quantization Mixture Models Probabilistic Latent Variable Models and EM algorithm Part III: Classification and Regression Empirical Risk Minimization Nearest Neighbours, Prototype Based Methods Classification and Regression Trees Linear Regression

Syllabus II Linear Discriminant Analysis Quadratic Discriminant Analysis Naive Bayes Bayesian Methods Logistic Regression Neural Networks Part IV: Ensemble Methods Bootstrap, Bagging Random Forests Boosting R Learning how to use R for Data Mining

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

What is Data Mining? Traditional Problems in Applied Statistics Well formulated question that we would like to answer. Expensive to gathering data and/or expensive to do computation. Create specially designed experiments to collect high quality data. Current Situation Information Revolution - improvements in data storage devices (both larger and cheaper). - powerful data capturing devices (bioassays, microphones, cameras, satellites). lots of data with potentially valuable information available. Big Data...

What is Data Mining? To gain insight from data. Often working with huge datasets. Typically many variables (up to thousands or millions). Often, but not always many observations (dozens to millions). Secondary data sources possibly collected for other purposes. Uncurated data, missing data, unstructured data, multi-aspect data. Gain understanding without specific goals.

Applications of Data Mining Pattern Recognition - Sorting Cheques - Reading License Plates - Sorting Envelopes - Eye/ Face/ Fingerprint Recognition

Applications of Data Mining Business applications - Help companies intelligently find information - Credit scoring - Predict which products people are going to buy - Recommender systems - Autonomous trading Scientific applications - Predict cancer occurence/type and health of patients/personalized health - Make sense of complex physical, biological, ecological, sociological models...it is just a nice name for multivariate statistics ( minus model checking ).

NY Times: Data Mining in Walmart (URL)

NY Times: Career in Statistics (URL)

NY Times: R (URL)

Types of Data Mining Unsupervised Learning Unclassified data from which we would like to uncover hidden structure or groupings - Given detailed phone usage from many people, find interesting groups of people with similar behaviour. - Shopping habits for people using loyalty cards: find groups of similar shoppers. - Given expression measurements of 1000s of genes for 100s of patients, find groups of functionally similar genes. Goal: Hypothesis generation, visualization.

Types of Data Mining Supervised Learning A database of classified examples with predefined groupings - Given detailed phone usage of many users along with their historic churn, predict when/if people are going to change contracts again. - Given expression measurements of 1000s of genes for 100s of patients along with a binary variable indicating absence or presence of a specific cancer, predict if the cancer is present for a new patient. - Given expression measurements of 1000s of genes for 100s of patients along with survival length, predict survival time. Goal: Prediction.

Further Readings Leo Breiman: Statistical Modeling: The Two Cultures (URL) NY Times: Big Data s Impact In the World (URL) Economist: Data, Data Everywhere (URL) McKinsey: Big data: The Next Frontier for Competition (URL) Other recent news on Big Data, Data Mining, Machine Learning: New York Times: Sure, Big Data Is Great. But So Is Intuition (URL) New York Times: How Many Computers to Identify a Cat? 16,000 (URL) New York Times: Scientists See Promise in Deep-Learning Programs (URL) New Yorker: Is Deep Learning a Revolution in Artificial Intelligence? (URL)