Statistics W4240: Data Mining Columbia University Spring, 2014



Similar documents
CSCI-599 DATA MINING AND STATISTICAL INFERENCE

CS Data Science and Visualization Spring 2016

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Office: LSK 5045 Begin subject: [ISOM3360]...

Brown University Department of Economics Spring 2015 ECON 1620-S01 Introduction to Econometrics Course Syllabus

Course Description This course will change the way you think about data and its role in business.

MS1b Statistical Data Mining

ANGELO STATE UNIVERSITY Department of Accounting, Finance and Economics. Financial Management. Spring 2015 Syllabus

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

MAT 1111: College Algebra: CRN SPRING 2013: MWF 11-11:50: GRAY 208

AGEC 448 AGEC 601 AGRICULTURAL COMMODITY FUTURES COMMODITY FUTURES & OPTIONS MARKETS SYLLABUS SPRING 2014 SCHEDULE

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

ITNW 1337 Introduction to the Internet Course Syllabus: Spring 2015

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CRN: STAT / CRN / INFO 4300 CRN

CSC 314: Operating Systems Spring 2005

SJSU School of Journalism and Mass Communications Journalism 132, Section 1 Information Gathering Spring 2015

Rollins College Entrepreneurial and Corporate Finance BUS 320- H1X

Executive Master of Public Administration. QUANTITATIVE TECHNIQUES I For Policy Making and Administration U6311, Sec. 003

CHEM 30A INTRO CHEMISTRY SPR

Syllabus : CHM 234, General Organic Chemistry II : Spring 2015 Online/Hybrid Class SLN 15207

I INF 300: Probability and Statistics for Data Analytics (3 credit hours) Spring 2015, Class number 9873

FIVS 316 BIOTECHNOLOGY & FORENSICS Syllabus - Lecture followed by Laboratory

F l o r i d a G u l f C o a s t U n i v e r s i t y S t a t i s t i c a l M e t h o d s F a l l C R N

UNIVERSITY OF MASSACHUSETTS BOSTON COLLEGE OF MANAGEMENT AF Theory of Finance SYLLABUS Spring 2013

FIN 357 BUSINESS FINANCE

Florida Gulf Coast University Lutgert College of Business Marketing Department MAR3503 Consumer Behavior Spring 2015

BUAD 310 Applied Business Statistics. Syllabus Fall 2013

STAT 121 Hybrid SUMMER 2014 Introduction to Statistics for the Social Sciences Session I: May 27 th July 3 rd

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Data Mining Carnegie Mellon University Mini 2, Fall Syllabus

Decision Sciences Data Analysis for Managers

Lake-Sumter Community College Course Syllabus. STA 2023 Course Title: Elementary Statistics I. Contact Information: Office Hours:

DATA MINING FOR BUSINESS ANALYTICS

Systems and Internet Marketing Syllabus Spring 2011 Department of Management, Marketing and International Business

DSCI 3710 Syllabus: Spring 2015

STAT 370: Probability and Statistics for Engineers [Section 002]

PCB 3043: Ecology Spring 2012, MMC

Introduction to Statistics (Online) Syllabus/Course Information

Moravian College Department of Biological Sciences Anatomy and Physiology - BIO 104

ACC201: Introduction to Financial Accounting 1 Section 006: TR, pm, in CR115 Section 007: TR, pm, in BUSAD A101

College of Public Health & Health Professions

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

BIO Evolution. KSCommons. Keene State College. Sciences and Social Sciences, School of. Syllabi. Spring 2010

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Syllabus College Algebra (MAC 1105) 3 credit Fall 2011

BUS 3525 Strategic Management Online

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

The University of Texas at Austin Department of Civil, Architectural and Environmental Engineering

Physics 21-Bio: University Physics I with Biological Applications Syllabus for Spring 2012

College of Health and Human Services. Fall Syllabus

MGMT 280 Impact Investing Ed Quevedo

Introduction to predictive modeling and data mining

ISM 4113: SYSTEMS ANALYSIS & DESIGN

Bi-8: Introduction to Molecular Biology by Prof. Angela Stathopoulos

BCHM Analytical Biochemistry Syllabus Spring, 2013

Finance 471: DERIVATIVE SECURITIES Fall 2015 Prof. Liang Ma University of South Carolina, Moore School of Business

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Course Description. Course Objectives

Accounting Information Systems (ACC409) Spring 2015 School of Accountancy Shidler College of Business University of Hawaii at Manoa

BSCI222 Principles of Genetics Winter 2014 TENTATIVE

STAT 2300: BUSINESS STATISTICS Section 002, Summer Semester 2009

Investment Management Course

Categorical Data Analysis

Imperial Valley College Course Syllabus - Elementary Differential Equations Math 220

Introduction to data mining

College of Public Health University of South Florida. Department of Environmental and Occupational Health. Syllabus Page 1

X Predictive Analytics for Marketing, Reg#255343

Telephone: Meets twice a week for 90 minutes. Times vary each semester

Chemistry 201B Syllabus Cuesta College General College Chemistry II Fall units

Brazosport College Syllabus for PSYC 2301 General Psychology

Biology 1008 Anatomy and Physiology II Spring 2015

ENGR 102: Engineering Problem Solving II

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

SOUTHEAST MISSOURI STATE UNIVERSITY COURSE SYLLABUS. AB 604: Advanced Professional and Ethical Conduct of Behavior Analysts I OFFERED ONLINE

CHEM 124 and CHEM 125: College Chemistry

MIS Big Data Information Systems

ISM 4210: DATABASE MANAGEMENT

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

TEXTBOOK: FUNDAMENTALS OF ANATOMY & PHYSIOLOGY, 10 TH, Frederic H.

Psychology 420 (Sections 101 and 102) Experimental Psychology: Social Psychology Laboratory

Course Syllabus. Purposes of Course:

BCM :00-12:15 p.m. 1:30-3:35 p.m. Wednesday 10:00-12:00 noon

Instructor/TA Info. Course Information. Instructor Information. Description. Prerequisites. Materials. Learning Outcomes

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Class Day & Time: Tuesday & Thursday, 10:25 am 1:25 pm Office Location: INST 2014 Classroom: INST 2014

PSY 2012 General Psychology Sections 4041 and 1H85

CSC 341, section 001 Principles of Operating Systems Spring 2015 Monday/Wednesday 1:00 PM 2:15 PM

CLASS POLICIES - ONLINE

BIOL 2002 Anatomy & Physiology II Course Syllabus: SPRING 2014

Transcription:

Statistics W4240: Data Mining Columbia University Spring, 2014 Version: January 30, 2014. The syllabus is subject to change, so look for the version with the most recent date. Course Description Massive data collection and storage capacities have led to new statistical questions: Amazon collects purchase histories and item ratings from millions of its users. How can it use these to predict which items users are likely to purchase and like? Yahoo news acts as a clearinghouse for news stories and collects user click-through data on those stories. How should it organize the stories based on the click-through data and the text of each story? Advances in molecular biology have allowed scientists to gather massive amounts of genomic data. How can this be used to predict gene interactions? Large medical labs can receive thousands of tissue and cell samples per day. How can they automatically screen cancerous specimens from non-cancerous ones, preferably with a higher accuracy than doctors? Facebook gets millions of photographs annotated by its users. How can it use this data to automatically detect who is in newly uploaded photos? Many new problems in science, industry, arts and entertainment require traditional and nontraditional forms of data analysis. In this course, you will learn how to use a set of methods for modern data mining: how to use each method, the assumptions, computational costs, how to implement it, and when not to use it. Most importantly, you will learn how to think about and model data analysis problems. Administrative Lecture (Note that seating will be very tight, so it is critical that you go to your assigned section) Section 1 (Hannah): Tuesday and Thursday, 6:10PM 7:25PM Location: 614 Schermerhorn Section 2 (Cunningham): Monday and Wednesday, 6:10PM 7:25PM Location: 329 Pupin Laboratories 1

Instructors Lauren Hannah O ce: Department of Statistics, 1009 School of Social Work, 1255 Amsterdam Email: lah2178@columbia.edu John Cunningham O ce: Department of Statistics, 1026 School of Social Work, 1255 Amsterdam Email: jpc2181@columbia.edu Teaching Assistants Phyllis Wan O ce Hours: Mondays 2:40pm - 4:40pm O ce: Department of Statistics, 10th Floor School of Social Work, 1255 Amsterdam Email: pw2348@columbia.edu Feihan Lu O ce Hours: Tuesdays 3:30pm - 5:30pm O ce: Department of Statistics, 10th Floor School of Social Work, 1255 Amsterdam Email: fl2238@columbia.edu Virtual O ce Hours Owing to student schedules and the subsequent challenges of finding mutually suitable o ce hours, we will use a virtual platform. Piazza is a highly regarded forum for students to discuss class questions, homework problems, and more. Discussing problems is encouraged, but full solutions should not be posted (see section on academic integrity). The tool can be found at: https://piazza.com/class/hq76cfqlygz6no?cid=6. ManyColumbiaclasses find this to be a much quicker and more e ective tool than traditional o ce hours, and we encourage students to use it both to ask questions and to improve their own understanding by posting answers and comments. Prerequisites Apreviouscourseinstatistics,elementaryprobability,multivariatecalculus,linearalgebra and ability to do moderate coding in R. A quiz will be given during the first class to allow students to self-assess their preparedness for this course. Grading and Academic Integrity We take the honor code very seriously; students caught cheating or otherwise in violation will face disciplinary action. Please note the Barnard honor code text: We... resolve to uphold the honor of the College by refraining from every form of dishonesty in our academic life. We consider it dishonest to ask for, give, or receive help in examinations or quizzes, to use any papers or books not authorized by the instructor in examinations, or to present oral work or written work which is not entirely our own, 2

unless otherwise approved by the instructor... We pledge to do all that is in our power to create a spirit of honesty and honor for its own sake. http://barnard.edu/node/2875 https://www.college.columbia.edu/academics/academicintegrity Your grade will be determined by three di erent components: Homework (30%). Homework will contain both written and R data analysis elements. This is due online by the beginning of class on the due date. Midterm Exam (30%). This will be given in class during midterm week. You will be permitted use one handwritten page, front and back, of notes. Final Exam (40%). This will be given in class during the finals period. You will be permitted use one handwritten page, front and back, of notes. Failure to complete any of these components may result in a D or F. Grades may be adjusted upward or downward to reflect trajectory. Late Work and Regrading Policy: No late work or requests for regrades are accepted. Homework: Students are encouraged to work together, but homework write-ups must be done individually and must be entirely the author s own work. Homework is due at the beginning of the class for which it is due. Late homework will not be accepted under any circumstances. To receive full credit, students must thoroughly explain how they arrived at their solutions and include the following information on their homeworks: name, UNI, homework number (e.g., HW03), class (STAT W4240), and section number. All homework must be turned in online through Courseworks in two parts: 1) The written part of submitted homework must be in PDF format, have a.pdf extension (lowercase!), and be less than 4MB; and 2) the code portion of submitted homework must be in R and have a.r extension (uppercase!). Homeworks not adhering to these requirements will receive no credit. Syllabus and Readings Readings come from the following texts: James, G., Witten, D. Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning Springer, 2014. Torgo, L. Data Mining with R. CRC Press, 2011. Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Edition. Springer, 2009. The readings will cover more than we will in class. Other useful books: 3

Adler, J. R in a Nutshell: A Desktop Quick Reference. O Reilly Media, 2010. Bishop, C. Pattern Recognition and Machine Learning. Springer-Verlag, 2006. Witten, I. H., Frank, E. and Hall, M. A. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman, 2011. Wu, X. and Kumar, V., eds. Top Ten Algorithms in Data Mining. CRC Press, 2009. Table 1: Section 001 # Class Subject Reading Other 01 T Jan 21 (cancelled) 02 Th Jan 23 Intro 03 T Jan 28 Intro to R James Ch. 2.3 04 Th Jan 30 Probability Models Torgo Ch. 1 05 T Feb 4 Dimension Reduction I James Ch. 10.1, 10.2, HW 1 Due 10.4 06 Th Feb 6 Dimension Reduction II 07 T Feb 11 Supervised Learning I James Ch. 2.1 08 Th Feb 13 Supervised Learning II James Ch. 2.2 09 T Feb 18 Linear Models I James Ch. 3.1-3.3.1 HW 2 Due 10 Th Feb 20 Linear Models II James Ch. 3.3.2-3.6 11 T Feb 25 Logistic Regression James Ch. 4.1-4.3 12 Th Feb 27 Linear Discriminant Analysis James Ch. 4.4-4.6 13 T Mar 4 Naive Bayes Hastie Ch. 6.6 HW 3 Due 14 Th Mar 6 Cross Validation James Ch. 5.1 15 T Mar 11 Midterm 16 Th Mar 13 The Bootstrap James Ch. 5.2-5.3 T Mar 18 No Class Spring Break Th Mar 20 No Class Spring Break 17 T Mar 25 Subset Selection James Ch. 6.1, 6.5 18 Th Mar 27 Shrinkage James Ch. 6.2-6.4, 6.6 19 T Apr 1 Trees I James Ch. 8.1 HW 4 Due 20 Th Apr 3 Trees II 21 T Apr 8 Boosting Hastie 10.1-10.9 22 Th Apr 10 Bagging and Random Forests James Ch. 8.2-8.3 23 T Apr 15 SVMs I James 9.1-9.2 HW 5 Due 24 Th Apr 17 SVMs II James Ch. 9.3-9.6 25 T Apr 22 SVMs III 26 Th Apr 24 Clustering I James Ch. 10.3, 10.5 27 T Apr 27 Clustering II 28 Th May 1 A Priori Hastie Ch. 14.1-14.2 HW 6 Due 4

Table 2: Section 002 # Class Subject Reading Other 01 W Jan 22 Intro 02 M Jan 27 Intro to R James Ch. 2.3 03 W Jan 29 Probability Models Torgo Ch. 1 04 M Feb 3 Data Cleaning 05 W Feb 5 Dimension Reduction I James Ch. 10.1, 10.2, HW 1 Due 10.4 06 M Feb 10 Dimension Reduction II 07 W Feb 12 Supervised Learning I James Ch. 2.1 08 M Feb 17 Supervised Learning II James Ch. 2.2 09 W Feb 19 Linear Models I James Ch. 3.1-3.3.1 HW 2 Due 10 M Feb 24 Linear Models II James Ch. 3.3.2-3.6 11 W Feb 26 Logistic Regression James Ch. 4.1-4.3 12 M Mar 3 Linear Discriminant Analysis James Ch. 4.4-4.6 13 W Mar 5 Naive Bayes Hastie Ch. 6.6 HW 3 Due 14 M Mar 10 Cross Validation James Ch. 5.1 15 W Mar 12 Midterm M Mar 17 No Class Spring Break W Mar 19 No Class Spring Break 16 M Mar 24 The Bootstrap James Ch. 5.2-5.3 17 W Mar 26 Subset Selection James Ch. 6.1, 6.5 18 M Mar 31 Shrinkage James Ch. 6.2-6.4, 6.6 19 W Apr 2 Trees I James Ch. 8.1 HW 4 Due 20 M Apr 7 Trees II 21 W Apr 9 Boosting Hastie 10.1-10.9 22 M Apr 14 Bagging and Random Forests James Ch. 8.2-8.3 23 W Apr 16 SVMs I James 9.1-9.2 HW 5 Due 24 M Apr 21 SVMs II James Ch. 9.3-9.6 25 W Apr 23 SVMs III 26 M Apr 28 Clustering I James Ch. 10.3, 10.5 27 W Apr 30 Clustering II 28 M May 5 A Priori Hastie Ch. 14.1-14.2 HW 6 Due 5