Introduction to predictive modeling and data mining



Similar documents
Introduction to data mining

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

Statistics W4240: Data Mining Columbia University Spring, 2014

KENNESAW STATE UNIVERSITY GRADUATE COURSE PROPOSAL OR REVISION, Cover Sheet (10/02/2002)

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Version control. with git and GitHub. Karl Broman. Biostatistics & Medical Informatics, UW Madison

: Introduction to Machine Learning Dr. Rita Osadchy

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Course Description This course will change the way you think about data and its role in business.

BA in Management Program Summer Semester, 2015 MKTG 410 Social Media and Search Marketing Analytics

Faculty of Science School of Mathematics and Statistics

INSC 102 Technologies for Information Retrieval FALL 2014 SECTION 002 Delivered online via Asynchronous Distance Education (ADE)

CASPER COLLEGE COURSE SYLLABUS

CS 43: Computer Networks Course Introduction. Grab a clicker and please sit towards the front, next to other students!

CS Data Science and Visualization Spring 2016

The University of Texas at Tyler COLLEGE OF BUSINESS & TECHNOLOGY Fall Semester 2013 Course Syllabus. Introduction to the American Health Care System

Data Mining Carnegie Mellon University Mini 2, Fall Syllabus

POL 204b: Research and Methodology

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

Digital Systems. Syllabus 8/18/2010 1

Canisius College Richard J. Wehle School of Business Department of Marketing & Information Systems Spring 2015

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

In LIT 61: Science Fiction Literature, you will:

QMB Business Analytics CRN Fall 2015 T & R 9.30 to AM -- Lutgert Hall 2209

College of Health and Human Services. Fall Syllabus

SUMA K4205 GIS for Sustainability Management. Instructor Information: Dara Mendeloff GIS Specialist, CIESIN

PHOT 180 ONLINE Photography 1 Three (3) Credits

INTRODUCTION to PSYCHOLOGY INTRODUCTORY STATISTICS A Presidential Scholars Interdisciplinary Seminar

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Governors State University College of Business and Public Administration. Course: STAT Statistics for Management I (Online Course)

23 AC: American Cybercultures: Principles of Internet Citizenship

FEEG Applied Programming 3 - Version Control and Git II

QMB Business Analytics CRN Fall 2015 W 6:30-9:15 PM -- Lutgert Hall 2209

Advanced Accounting. Phone calls are welcomed and will usually be answered within 24 hours or less. My phone number will be provided after enrollment

School of Business and Nonprofit Management Course Syllabus

Information Systems and Technology in Healthcare

Semester/Year: Spring, 2016

THE UNIVERSITY OF TOLEDO College of Social Justice & Human Service PARALEGAL STUDIES PROGRAM LGL 1720:001 LAW PRACTICE MANAGEMENT SYLLABUS FALL 2015

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Preliminary Syllabus for the course of Data Science for Business Analytics

AMBERTON UNIVERSITY e-course SYLLABUS

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Course Syllabus. Purposes of Course:

Spring 2013 CS 6930 Advanced Topics in Web Security and Privacy - 3 Credit Hours Syllabus and Course Policies

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

CS 394 Introduction to Computer Architecture Spring 2012

Advanced Blackboard 9.1 Features

What is Artificial Intelligence?

BUAD 310 Applied Business Statistics. Syllabus Fall 2013

BIO 315 Human Genetics - Online

JOHN A. LOGAN COLLEGE S. Trammell SM 13. BIO 225 GENETICS 3 cr. (3-0) (Online)

CS479/579 Special Topics: Social Computing Syllabus. Computer Science Department, New Mexico State University 01/20/ /13/2016

ECON 351: Microeconomics for Business

INDUSTRIAL-ORGANIZATIONAL PSYCHOLOGY

Wireless Network Security Spring 2015

MKTG 330 FLORENCE: MARKET RESEARCH Syllabus Spring 2011 (Tentative)

MATH 1050, College Algebra, QL, 4 credits. Functions: graphs, transformations, combinations and

MATH 104 FINITE MATHEMATICS

Advanced Statistics & Data Analysis

CTC 310 Software Project Management

AUSTIN COMMUNITY COLLEGE DEPARTMENT OF COMPUTER STUDIES AND ADVANCED TECHNOLOGY

Horticulture Syllabus Ms. Abbie Westby Lakeview Public Schools Agriculture Teacher/FFA Advisor

T/Th 10:05-11:40 Creative Arts 113

Teaching Staff. Welcome to. TAs. How Fast? Speed (DMIPS) Computer CPU Performance Trend

Course Content Concepts

DATA MINING FOR BUSINESS ANALYTICS

Earth Science 101 Introduction to Weather Fall 2015 Online

ECON 424/CFRM 462 Introduction to Computational Finance and Financial Econometrics

Course title: Management Information Systems Fall 2010 Course number: CRN: Location: Meeting day: Meeting time:

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

AMBERTON UNIVERSITY e-course SYLLABUS

Machine Learning Capacity and Performance Analysis and R

ERP SYSTEM IMPLEMENTATION (Enterprise Resource Planning) (Financials) - Training Program

DEPARTMENT OF KINESIOLOGY KINESIOL 3E03 / Life Science 3K03: Neural Control of Human Movement Course Outline for Winter 2015

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Course Specification. Siam University. International Business Program. 1. General Information of Course Outline

Mathematics for Business and Economics ( MATH 3210 WEB ) SPRING 2016 Instructor: Dr. Sankara N. Sethuraman Please call me Dr. Sankar.

DESIGN FOR USER EXPERIENCE (ITP 310)

Syllabus UPDATED 2/18/2015. Assistant Prof. Steven L. Johnson, Ph.D. Management Information Systems

CALIFORNIA STATE UNIVERSITY DOMINGUEZ HILLS

COURSE SYLLABUS PAD 3003 Public Administration in American Society Professor: Dr. Katie Keeton Associate Professor Office: Virtual:

AMBERTON UNIVERSITY e-course SYLLABUS

San José State University School of Journalism and Mass Communications PR99 Contemporary Public Relations. Fall 2015

FIT College Online. User guide. Step 6: Navigation (step by step) guide to learning topics

AD014 Principles of Finance 2011/2012. Course Contents. Methodology. Readings. Competencies DESCRIPTION

CS 341: Foundations of Computer Science II elearning Section Syllabus, Spring 2015

COURSE SYLLABUS ACCT 102 ID8W2, PRINCIPLES OF ACCOUNTING II 2015FA

MA2823: Foundations of Machine Learning

Minnesota Virtual Academy Online Syllabus for Eng403 British and World Literature Course Instructor and Communications Name: Mrs.

DSCI 3710 Syllabus: Spring 2015

Spring 2015 Syllabus for ENG : Writing Experience I

Instructor: Richard Burton, MBA,

CS 261 C and Assembly Language Programming. Course Syllabus

Computer Science 3CN3 Computer Networks and Security. Software Engineering 4C03 Computer Networks and Computer Security. Winter 2008 Course Outline

Transcription:

Introduction to predictive modeling and data mining Rebecca C. Steorts Predictive Modeling and Data Mining: STA 521 August 25 2015 1

Today s Menu 1. Brief history of data science (from slides of Bin Yu) 2. Motivation of this course. 3. What is predictive modeling and data science? 4. The boring bits (but they re important). 2

Data science [Bin Yu, IMS Presidential Address, 2014] 3

Data science [Bin Yu, IMS Presidential Address, 2014] 4

Data science [Bin Yu, IMS Presidential Address, 2014] 5

Data science [Bin Yu, IMS Presidential Address, 2014] 6

Data science [Bin Yu, IMS Presidential Address, 2014] 7

Data science [Bin Yu, IMS Presidential Address, 2014] 8

Data science [Bin Yu, IMS Presidential Address, 2014] 9

Data science [Bin Yu, IMS Presidential Address, 2014] 10

11 Data science [Bin Yu, IMS Presidential Address, 2014]

12 Data science [Bin Yu, IMS Presidential Address, 2014]

More about data science and how it s relevant today... 13

Data science, today [Credit: Jenny Bryan] 14

Data science, today [Credit: Jenny Bryan] 15

Data science, today 16

Data science, today 17

So what is the class all about? 18

19 What is data mining? Data mining is the science of discovering structure and making predictions in (large) data sets Unsupervised learning: discovering structure E.g., given measurements X 1,... X n, learn some underlying group structure based on similarity Supervised learning: making predictions I.e., given measurements (X 1, Y 1 ),... (X n, Y n ), learn a model to predict Y i from X i

19 What is data mining? Data mining is the science of discovering structure and making predictions in (large) data sets Unsupervised learning: discovering structure E.g., given measurements X 1,... X n, learn some underlying group structure based on similarity Supervised learning: making predictions I.e., given measurements (X 1, Y 1 ),... (X n, Y n ), learn a model to predict Y i from X i Note: Hidden underneath is the idea of prediction (which we will get into). As we have talked about words like data science, data mining, etc are just more sexy!

20 Google Search Ads Gmail Chrome

21 Facebook People you may know

22 Netflix $1M prize!

23 eharmony Falling in love with statistics

24 FICO An algorithm that could cause a lot of grief

25 FlightCaster Apparently it s even used by airlines themselves

26 IBM s Watson A combination of many things, including data mining

27 Handwritten postal codes (From ESL p. 404) We could have robot mailmen someday

28 Subtypes of breast cancer Subtypes of breastcancer based on wound response

Predicting Alzheimer s disease (From Raji et al. (2009), Age, Alzheimer s disease, and brain structure ) Can we predict Alzheimer s disease years in advance? 29

Kaggle 2015 Challenge Competition to find interesting information in Census data and maps. 30

31 What to expect Expect to be able to deal with messy data and writing coding that is reproducible Why? Real applied problems are messy (and for others to understand how you attacked something, it s important that your method, process, and code be accessible, well documented, and reproducible) You can t always open up R, download a package, and get a reasonable answer Real data is messy and always presents new complications Understanding why and how things work is a necessary precursor to figuring out what to do

32 Reoccuring themes Exact approach versus approximation: often when we can t do something exactly, we ll settle for an approximation. Can perform well, and scales well computationally to work for large problems Bias-variance tradeoff: nearly every modeling decision is a tradeoff between bias and variance. Higher model complexity means lower bias and higher variance Interpretability versus predictive performance: there is also usually a tradeoff between a model that is interpretable and one that predicts well under general circumstances

33 There s not a universal recipe book Unfortunately, there s no universal recipe book for when and in what situations you should apply certain data mining methods Statistics doesn t work like that. Sometimes there s a clear approach; sometimes there is a good amount of uncertainty in what route should be taken. That s what makes it so hard, and so fun This is true even at the expert level (and there are even larger philosophical disagreements spanning whole classes of problems) The best you can do is try to understand the problem, understand the proposed methods and what assumptions they are making, and find some way to evaluate their performances

34 Hopefully you re still awake What do I need to know about the course?

35 Course staff: Instructor: Rebecca Steorts (you can call me Beka or Professor Steorts, please not Professor ) TAs: Abbas Zaidi and Yikun (Joey) Zhou Why are you here? Because you love the subject, because it s required, because you eventually want to make $$$... No matter the reason, everyone can get something out of the course Work hard and have fun!

36 Culture of the class Teaching you to fish (versus giving you one). It s amazing what a determined individual can learn from documentation, small learning examples, and... gasp Googling. And also stackoverflow. Rewarding engagement, intellectual generosity and curiosity. Speaking up, sharing success OR failure, showing some interest in something will earn marks. Zero tolerance of plagiarism! Generating your own ideas, your own code, and finding your own way is a big reason you re here. The process is much more important than simply getting to the end point or product.

37 Logistics/Grading Two lectures a week: concepts, methods, examples Lab to try stuff out and get fast feedback (10%) Participation, creativity in new ideas, sharing in your successes and failures, etc. (5%) HW weekly to do longer and more complex things (35%) Mid-term exam (25%) Final project in groups of 2-3, will be fun! (25%)

38 Prerequisites: Assuming you know basic probability and statistics, linear algebra, R programming (see syllabus for topics list) Textbooks: Course textbook Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani. Get it online at http://www-bcf.usc.edu/~gareth/isl. Bayesian Essentials with R, Marin and Robert, Please order this one (we won t need it for about two weeks). More advanced textbook: Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. Also available online at http://www-stat.stanford.edu/elemstatlearn

39 Markdown, RStudio, and LaTex You must type all assignments and take home exams in Markdown (we will talk about submissions in class). All code must be written in RStudio. https: //guides.github.com/features/mastering-markdown/ https://rstudio-pubs-static.s3.amazonaws.com/18858_ 0c289c260a574ea08c0f10b944abc883.html

40 Turning in Your Own Work You may work together on homework. In fact, you should. You ll learn a great deal. But... All code, write ups, etc must be your own work and not shared. All write ups must be your own work and not shared or copied in any manner. All take home exams or projects that are not collaborative are to be your own work. You may not work together. If I find out that any assignment is not your own, you will receive a 0 on the assignment and you will be reported to as per the university s policy on cheating and plagiarism. You will submit all work online, which Abbas will explain.

41 Setting up for success 1. R or RStudio 2. Intro to RStudio and Markdown: first lab. 3. git and bitbucket

42 Quick intro to git Download git and bitbucket. 1. git init 2. git add file.txt 3. git log 4. git status 5. git commit -a -m here are my changes. 6. git push You will understand more complicated git commands in your lab. (Branching, Merging, Uploading a file, etc.)

43 Who am I Assistant prof and affiliated faculty at SSRI and iid. Specialize in record linkage and dimension reduction methods and algorithms for applications in human rights conflicts, official statistics, social networks, medical databases, and many others. Methods I work on focus on Bayesian methods, machine learning, and scalable algorithms (intensive computing). PhD in 2012 from University of Florida and finished Visiting Assistant Professorship at CMU in 2015. First semester at Duke and second time teaching the course, but changing many things! Very excited to be here.

Next time: RStudio, Markdown, and git. 44