Introduction to data mining



Similar documents
Introduction to predictive modeling and data mining

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Statistics W4240: Data Mining Columbia University Spring, 2014

Data Mining Carnegie Mellon University Mini 2, Fall Syllabus

: Introduction to Machine Learning Dr. Rita Osadchy

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

BA 561: Database Design and Applications Acct 565: Advanced Accounting Information Systems Syllabus Spring 2015

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Business Analytics Syllabus

DATA MINING FOR BUSINESS ANALYTICS

Spring 2013 CS 6930 Advanced Topics in Web Security and Privacy - 3 Credit Hours Syllabus and Course Policies

CS Data Science and Visualization Spring 2016

INTRODUCTION to PSYCHOLOGY INTRODUCTORY STATISTICS A Presidential Scholars Interdisciplinary Seminar

ECE 156A - Syllabus. Lecture 0 ECE 156A 1

Office: LSK 5045 Begin subject: [ISOM3360]...

QMB Business Analytics CRN Fall 2015 T & R 9.30 to AM -- Lutgert Hall 2209

Preliminary Syllabus for the course of Data Science for Business Analytics

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Google Apps for Education (GAFE) Basics

Course Description This course will change the way you think about data and its role in business.

MATH 205 STATISTICAL METHODS

BUAD : Corporate Finance - Spring

Data Science Certificate General Information About Completion

Faculty of Science School of Mathematics and Statistics

BUSA 501: Introduction to Business Analytics

Brown University Department of Economics Spring 2015 ECON 1620-S01 Introduction to Econometrics Course Syllabus

FREEMAN SCHOOL OF BUSINESS. FINE International Finance Spring 2012

PSYCHOLOGY OF PERSONALITY

AMIS 7640 Data Mining for Business Intelligence

Cognitive Science. Summer 2013

QMB Business Analytics CRN Fall 2015 W 6:30-9:15 PM -- Lutgert Hall 2209

Shaw University Online Courses FAQ

CSE 5392 Sensor Network Security

COURSE SYLLABUS. ESE 544/444 Project Management

LSC 740 Database Management Syllabus. Description

Financial Optimization ISE 347/447. Preliminaries. Dr. Ted Ralphs

Introduction to Psychology

BUSINESS INTELLIGENCE WITH DATA MINING FALL 2012 PROFESSOR MAYTAL SAAR-TSECHANSKY

1. Strengthen your knowledge of the biological principals governing the nervous system.

ECON 351: Microeconomics for Business

ECE 3200 Electronics I First Summer Session 2015 (Online) Syllabus

Introduction to Psychology 2101

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

MGT 5309 FALL 07 LOGISTICS AND SUPPLY CHAIN MANAGEMENT SYLLABUS

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

MyMathLab User Guide

Blackboard Learning System: Student Instructional Guide

BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS. Spring 2015

Online Basic Statistics

Information Systems and Technology in Healthcare

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Instructional Development Unit Quality and Efficiency

Machine Learning Capacity and Performance Analysis and R

If you need support with Blackboard, your college , and/or. MySite, go to these sites:

BIO 315 Human Genetics - Online

IT 201 Information Design Techniques

ENVS 298: Environmental Accountability and Data Visualization

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

or simply Google John Penn WVU and take the top hit. Useful Websites to Help the Organic Chemistry Class

AMS 5 Statistics. Instructor: Bruno Mendes mendes@ams.ucsc.edu, Office 141 Baskin Engineering. July 11, 2008

Computer Engineering ECSE-322B

Getting Started with WebCT

CS 5890: Introduction to Data Science Syllabus, Utah State University, Fall

MARK 7377 Customer Relationship Management / Database Marketing. Spring Last Updated: Jan 12, Rex Yuxing Du

MATH 205 STATISTICAL METHODS

Semester/Year: Spring, 2016

Categorical Data Analysis

2015 Workshops for Professors

Decision Sciences Department Business Analytics Program. Decision Sciences 6290: Introduction to Business Analytics (1.

CSE 517A MACHINE LEARNING INTRODUCTION

MATHEMATICS 152, FALL 2004 METHODS OF DISCRETE MATHEMATICS

Applied mathematics and mathematical statistics

L&I SCI 410: Database Information Retrieval Systems

Probability Concepts Probability Distributions. Margaret Priest Nokuthaba Sibanda

ANALYTICS CENTER LEARNING PROGRAM

Course Syllabus. Purposes of Course:

Scope of this Course. Database System Environment. CSC 440 Database Management Systems Section 1

Math 35 Section Spring Class meetings: 6 Saturdays 9:00AM-11:30AM (on the following dates: 2/22, 3/8, 3/29, 5/3, 5/24, 6/7)

Project Management Syllabus

BUAD 310 Applied Business Statistics. Syllabus Fall 2013

Finance 3503 Corporate Finance II Spring 2012

What Makes a Website Credible?

Online Class Orientation

MAC 2233, STA 2023, and junior standing

How To Build A Predictive Model In Insurance

ENTC 219 Digital Electronics Fall 2015 TR 11:10 12:25 Thompson 121

Transcription:

Introduction to data mining Ryan Tibshirani Data Mining: 36-462/36-662 January 15 2013 1

Logistics Course website (syllabus, lectures slides, homeworks, etc.): http://www.stat.cmu.edu/~ryantibs/datamining We will use blackboard site for email list, grades 6 homeworks, top 5 will count 2 in-class exams, final project Final project in groups of 2-3, will be fun! 2

Prerequisites: Only formal one is 36-401 Assuming you know basic probability and statistics, linear algebra, R programming (see syllabus for topics list) Textbooks: Course textbook Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani. Get it online at http://www-bcf.usc.edu/~gareth/isl, use the login name: StatLearn, password: book. Not yet published, do not distribute! Optional, more advanced textbook: Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. Also available online at http://www-stat.stanford.edu/elemstatlearn 3

Course staff: Instructor: Ryan Tibshirani (you can call me Ryan or Professor Tibshirani, please not Professor ) TAs: Li Liu, Cong Lu, Jack Rae, Michael Vespe Why are you here? Because you love the subject, because it s required, because you eventually want to make $$$... No matter the reason, everyone can get something out of the course Work hard and have fun! 4

What is data mining? Data mining is the science of discovering structure and making predictions in (large) data sets Unsupervised learning: discovering structure E.g., given measurements X 1,... X n, learn some underlying group structure based on similarity Supervised learning: making predictions I.e., given measurements (X 1, Y 1 ),... (X n, Y n ), learn a model to predict Y i from X i 5

Google Search Ads Gmail Chrome 6

Facebook People you may know 7

Netflix $1M prize! 8

eharmony Falling in love with statistics 9

10 FICO An algorithm that could cause a lot of grief

11 FlightCaster Apparently it s even used by airlines themselves

12 IBM s Watson A combination of many things, including data mining

13 Handwritten postal codes (From ESL p. 404) We could have robot mailmen someday

14 Subtypes of breast cancer Subtypes of breastcancer based on wound response

Predicting Alzheimer s disease (From Raji et al. (2009), Age, Alzheimer s disease, and brain structure ) Can we predict Alzheimer s disease years in advance? 15

16 Banff 2010 challenge Find the Higgs boson particle and win a Nobel prize! Will it be found by a statistician?

17 What to expect Expect both applied and theoretical perspectives... not just a course where we open up R and go from there, we also rigorously investigate the topics we learn Why? Because success in data mining comes from a synergy between practice and theory You can t always open up R, download a package, and get a reasonable answer Real data is messy and always presents new complications Understanding why and how things work is a necessary precursor to figuring out what to do

18 Reoccuring themes Exact approach versus approximation: often when we can t do something exactly, we ll settle for an approximation. Can perform well, and scales well computationally to work for large problems Bias-variance tradeoff: nearly every modeling decision is a tradeoff between bias and variance. Higher model complexity means lower bias and higher variance Interpretability versus predictive performance: there is also usually a tradeoff between a model that is interpretable and one that predicts well under general circumstances

19 There s not a universal recipe book Unfortunately, there s no universal recipe book for when and in what situations you should apply certain data mining methods Statistics doesn t work like that. Sometimes there s a clear approach; sometimes there is a good amount of uncertainty in what route should be taken. That s what makes it so hard, and so fun This is true even at the expert level (and there are even larger philosophical disagreements spanning whole classes of problems) The best you can do is try to understand the problem, understand the proposed methods and what assumptions they are making, and find some way to evaluate their performances

20 Next time: information retrieval but cool dude party michelangelo raphael rude... doc 1 19 0 0 0 4 24 0 doc 2 8 1 0 0 7 45 1 doc 3 7 0 4 3 77 23 0 doc 4 2 0 0 0 4 11 0 doc 5 17 0 0 0 9 6 0 doc 6 36 0 0 0 17 101 0 doc 7 10 0 0 0 159 2 0 doc 8 2 0 0 0 0 0 0 query 1 1 1 1 1 1 1