95-791 Data Mining Carnegie Mellon University Mini 2, Fall 2015. Syllabus



Similar documents
CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Course Description This course will change the way you think about data and its role in business.

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

UNIVERSITY OF SOUTHERN CALIFORNIA Marshall School of Business BUAD 425 Data Analysis for Decision Making (Fall 2013) Syllabus

Accounting : Accounting Information Systems and Controls. Fall 2015 COLLEGE OF BUSINESS AND INNOVATION

TA contact information, office hours & locations will be posted in the Course Contacts area of Blackboard by end of first week.

H. JOHN HEINZ III COLLEGE CARNEGIE MELLON UNIVERSITY PROJECT MANAGEMENT SPRING A3 / B3 COURSE SYLLABUS

DATA MINING FOR BUSINESS ANALYTICS

Preliminary Syllabus for the course of Data Science for Business Analytics

CRN: STAT / CRN / INFO 4300 CRN

Office: LSK 5045 Begin subject: [ISOM3360]...

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

East Los Angeles College. Section 1806 C2 127A MT ThF 10:35AM 12:00noon MyMathLab CourseID:

How To Learn Data Analytics

Statistics W4240: Data Mining Columbia University Spring, 2014

CSC-570 Introduction to Database Management Systems

EMPORIA STATE UNIVERSITYSCHOOL OF BUSINESS Department of Accounting and Information Systems. IS213 A Management Information Systems Concepts

Math 103, College Algebra Spring 2016 Syllabus MWF Day Classes MWTh Day Classes

ECE 3200 Electronics I First Summer Session 2015 (Online) Syllabus

BUSINESS INTELLIGENCE WITH DATA MINING FALL 2012 PROFESSOR MAYTAL SAAR-TSECHANSKY

CSC-310 Introduction to Geographic Information Systems

PSYC 270 Abnormal Psychology

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Los Angeles Pierce College. SYLLABUS Math 227: Elementary Statistics. Fall 2011 T Th 4:45 6:50 pm Section #3307 Room: MATH 1400

MET CS-581. Electronic Health Records. Syllabus

STAT 2300: BUSINESS STATISTICS Section 002, Summer Semester 2009

PSYC 101: General Psychology

IST687 Scientific Data Management

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Math 103, College Algebra Fall 2015 Syllabus TTh PM Classes

CS 394 Introduction to Computer Architecture Spring 2012

College of Health and Human Services. Fall Syllabus

Psychological Testing (PSYCH 149) Syllabus

CS 450/650 Fundamentals of Integrated Computer Security

COURSE SYLLABUS. Department of Social Sciences

Subject Description Form

VIRGINIA COMMONWEALTH UNIVERSITY Department of Biology BIOL : HUMAN ANATOMY FOR BIOLOGY MAJORS Fall 2016 SYLLABUS

F l o r i d a G u l f C o a s t U n i v e r s i t y S t a t i s t i c a l M e t h o d s F a l l C R N

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

MANA 2302 Communication in Organizations Fall 20XX Course Syllabus

Psychology Mind and Society Mondays & Wednesdays, 2:00 3:50 pm, 129 McKenzie Hall Fall 2013 (CRN # 16067)

Investment Management Course

ITNW 2305 Network Administration COURSE SYLLABUS

1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders

FI 630 Financial Management I

SYST 371 SYSTEMS ENGINEERING MANAGEMENT

Engineering Problem Solving and Programming (CS 1133)

MAT Elements of Modern Mathematics Syllabus for Spring 2011 Section 100, TTh 9:30-10:50 AM; Section 200, TTh 8:00-9:20 AM

INFS5873 Business Analytics. Course Outline Semester 2, 2014

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

MGMT 280 Impact Investing Ed Quevedo

EDF 3214: Human Development and Learning Section 901 Meeting Time: Mondays from 5-9 Room: CPR 256

Syllabus CIS 3630: Management Information Systems Spring 2009

Introduction to data mining

A. COURSE DESCRIPTION

CS 425 Software Engineering. Course Syllabus

Syllabus for Accounting 300 Applied Managerial Accounting California State University Channel Islands Fall 2004

MSIS 635 Session 1 Health Information Analytics Spring 2014

How To Pass A Customer Service Course At Tstc

Physics 21-Bio: University Physics I with Biological Applications Syllabus for Spring 2012

Faculty of Science School of Mathematics and Statistics

CS 425 Software Engineering. Course Syllabus

Course Syllabus. Purposes of Course:

MAT150 College Algebra Syllabus Spring 2015

DSCI 3710 Syllabus: Spring 2015

PSYC*3250, Course Outline: Fall 2015

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

COURSE OUTLINE - Marketing Research BUS , Fall 2015

ACCT 430: Accounting Ethics Leventhal School of Accounting University of Southern California Spring 2013

Syllabus: Business Strategic Management

COURSE SYLLABUS. Luis Hernandez Chemical & Environmental Building J TBA. luis.hernandez@harlingen.tstc.edu

POFT 1309 Administrative Office Procedures I COURSE SYLLABUS

STAT 121 Hybrid SUMMER 2014 Introduction to Statistics for the Social Sciences Session I: May 27 th July 3 rd

Syllabus. Finance 367: Investment Management

MKTG 330 FLORENCE: MARKET RESEARCH Syllabus Spring 2011 (Tentative)

SYLLABUS CCE CONSTRUCTION PLANNING & SCHEDULING FALL 2012

Discrete Mathematics I Distance Learning (online) sections

CHM 1025 ONLINE Fall/Spring Introduction to General Chemistry. East Campus Science Dept. (407)

OM 335: OPERATIONS MANAGEMENT (Summer 2012)

ERP 5210 Performance Dashboards, Scorecard, and Data Visualization Course Syllabus Spring 2015

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

02-201: Programming for Scientists

Strategic Use of Information Technology (CIS ) Summer /

Finance 471: DERIVATIVE SECURITIES Fall 2015 Prof. Liang Ma University of South Carolina, Moore School of Business

Course title: Management Information Systems Fall 2010 Course number: CRN: Location: Meeting day: Meeting time:

BMI 540: Computer Science with Java Programming Oregon Health & Science University

Instructor: Dr. Alan R. Lehman Teaching Assistants: Stephanie Turner 2209 LeFrak Hall (0301 & 0401) s.purucker.turner@hotmail.com

Brown University Department of Economics Spring 2015 ECON 1620-S01 Introduction to Econometrics Course Syllabus

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Transcription:

95-791 Data Mining Carnegie Mellon University Mini 2, Fall 2015 Syllabus Instructor Dr. Artur Dubrawski awd@cs.cmu.edu, Newell-Simon Hall 3121 Mondays, 4:45pm-5:55pm (advance notice please). Head Teaching Assistants Benedikt Boecking boecking@andrew.cmu.edu, NSH 3122 Jessie Chen jieshic+@andrew.cmu.edu, NSH 3123 Dr. Mathieu Guillame-Bert mathieug@andrew.cmu.edu, NSH 3123 Teaching Assistants Lujie (Karen) Chen Shuai Cao Mahesh Gajwani Aleksandr Glumov lujiec@andrew.cmu.edu shuaic@andrew.cmu.edu mgajwani@andrew.cmu.edu aglumov@andrew.cmu.edu Meetings Mondays, 6:00pm 8:50pm, Hamburg Hall 1000. Fridays, 9:00am 10:20am, Hamburg Hall 1000. Prerequisites 95-796 Statistics for IT Managers or instructor s permission based on the student s knowledge of fundamentals of probability and statistics. Previous experience with data analysis will be considered a plus, although it is not absolutely necessary. Introduction Data mining intelligent analysis of information stored in data sets has gained a substantial interest among practitioners in a variety of fields and industries. Nowadays, almost every organization collects data, which can be analyzed in order to support making better decisions, improving policies, discovering computer network intrusion patterns, designing new drugs, detecting credit fraud, making accurate medical diagnoses, predicting imminent occurrences of important events, monitoring and evaluation of reliability to preempt failures of complex systems, etc. About the Instructor: Artur Dubrawski is a scientist and a practitioner. He has been researching machine intelligence and its applications for twenty years. In the past, he has been affiliated with an advanced data mining firm, Schenley Park Research, and served as Chief Technical Officer at Aethon, a local high-tech company making autonomous delivery robots. Currently Dr. Dubrawski is a faculty member at the CMU Robotics Institute, where he directs the Auton Lab: a data mining and machine learning research group. 1

Auton Lab s work has yielded multiple deployments of analytic solutions and software in various government and industrial applications. Course Objectives This course will provide participants with an understanding of fundamental data mining methodologies and with the ability to formulate and solve problems with them. Particular attention will be paid to practical, efficient and statistically sound techniques, capable of providing not only the requested discoveries, but also estimates of their utility. The lectures will be complemented with hands-on experience with data mining software, primarily R, to allow development of basic execution skills. The scope of the course will cover the following groups of topics. Foundations. How to make data mining practical? (approximately 40% of class time) Learning from data: why, what and how? Fundamental tasks, issues and paradigms of learning models from data. Real world data is noisy and uncertain. How much can we trust the results of our analyses? Model selection. Reduction of dimensionality and data engineering. Measures of association between data attributes: information theoretic, correlational. Pragmatic methodologies for mining data (approximately 60% of class time) Predictive analytics: classification and regression. Cost-sensitive model selection using ROC approach. Compression of data and models for improved reliability, understandability, and tractability of large sets of highly dimensional data. Association rule learning and decision list learning, decision trees. Introduction to density estimation, anomaly detection, and clustering. Overview of mining complex types of data. Illustrative examples of real-world applications. Registration Pass-fail registration is not allowed in this course. Audit requests will be denied except for extraordinary circumstances. Academic Integrity and Classroom Habits Students are expected to strictly follow Carnegie Mellon University rules of academic integrity in this course. This means in particular that quizzes, examinations and homework are to be the work of the individual student using only permitted material and without any cooperation of other students or third parties. It also means that usage of work by others is only permitted in the form of quotations and any such quotation must be distinctively marked to enable identification of the student s own work and own ideas. All external sources used must be properly cited, including author name(s), publication title, year of publication, and a complete reference needed for retrieval. Regarding the group projects, the work should be the work of only the group members. In all their work students should not in any way rely on solutions to problems distributed in prior years or on the work of prior students or other current students. Violations will be penalized to the full extent mandated by the CMU policies. There will be no exceptions. Usage of electronic equipment such as portable computers or telephones during lectures is very strongly discouraged, except for meetings specifically designated for hands-on software demonstrations, and the beginning of each lecture when a quiz is to be administered. 2

No student may record or tape any classroom activity without the express written consent of the instructor. If a student believes that he/she is disabled and needs to record or tape classroom activities, he/she should contact the CMU Office of Disability Resources to request an appropriate accommodation. Lecture Notes Hard copies of the lecture notes will not be distributed. The notes will be available for download from the course blackboard site at least 12 hours before each lecture. If the course blackboard site is not activated in time for the first class meeting, the notes will be emailed to all registered students. Email will also be used to deliver course material to those on the waiting list. Students are encouraged to bring their printed copies of the notes to class. Reading Material Unfortunately, the ideal textbook for this course does not exist. Instead, we will use a selection of readings excerpted from a variety of sources. These readings are intended to complement the material presented in class. Selected issues covered by the required readings will become topics of graded assignments and final examination. Some of the required readings might be handed out in hard copies, but most of the required material will be distributed electronically, either via email or through the course blackboard site, or as pointers to the resources available on the internet for free download. Note that many of the readings are protected under copyright law. In order to use them in this course it was necessary to purchase official permissions from the copyright holders. Each enrolled student could have their HUB account charged with an equal share of the copyright fees. Although the exact amount of the individual share is not known at the moment of writing this document, it is estimated to not exceed $30.00. Please note that it is illegal to distribute copies of the copyrighted materials without obtaining permissions from their legal owners. Interested students are welcome to go beyond the scope of the required readings. In particular, the following books are recommended - but not required - listed in no particular order: 1. Hand, Mannila and Smyth: Principles of Data Mining, MIT Press, 2001. 2. Witten and Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000 (with newer editions avaiable). 3. Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer 2001 4. Mitchell: Machine Learning, McGraw-Hill, 1997. Software and Hands-on Exercises We will primarily rely on R free software to demonstrate and operationalize concepts presented during lectures. Students are expected to download and install the software, as well as learn basic usage skills on their own using tutorials available online. Appropriate resources will be recommended during the first lecture. Recitations will review concepts taught in lectures and connect them to homework problems through examples. Recitation sessions when software tools are introduced (R and Tableau) will provide hands-onexperience opportunity: the students will be asked to follow the presenter using their laptops and they will work on assigned exercises while in session. Distribution of Assignments and Submission of Reports All assignments will be distributed electronically either through the course blackboard or via email. All reports (including homework) must be submitted electronically through the course blackboard. During the first week of the course, assignments will be emailed to students who are on the wait list. They will be required to email their reports in the electronic form to the TAs and the instructor by or before designated deadlines, unless they receive access to the course blackboard after being moved to the regular roster. 3

Grading Grades will be based upon the results of four homework assignments, one analytical project, six inclassroom quizzes, and a written, in-class final examination. Homework assignments are due at the beginning of the class when the homework is due. Late homework will be accepted until 24 hours past the deadline, but it will be subject to an automatic 50% grade reduction. Open book quizzes will be administered at the beginning of every Monday class starting from the second week of classes. They will measure the level of comprehension of the key concepts covered in the lectures and obligatory readings. The analytical projects will be conducted in small groups of students. Each team will analyze specific realworld data. The project will be graded based on preparedness to review meetings, two written progress reports and one final report, and a recording of an oral presentation of the results. Class participation is considered to be a valuable criterion in grading performance. Empirically, the students who take an active part in lectures and those who frequent recitations and office hours tend to grasp the taught concepts in a more effective manner than those who prefer to assume a passive approach. The activity component of the grade will consider records of attendance in lectures and recitations, and the quality (not just the quantity) of the student s contributions to classroom discussions. The final grade for this course will be the result of a combination of elementary evaluations of a spectrum of skills specifically looked for by top employers. They include: the ability to solve problems individually given an abundance of resources (homework); the ability to perform individually under stress and time constraints with (quizzes) and without access to external sources of information (final exam); teamwork (group projects); and the capability of focused verbal interaction (active participation). Homework (4 times 6%) 24% Quizzes (6 times 2%) 12% Analytical project (in teams) 30% Active participation 6% Final exam 28% 100% Timetable and Schedule of Lecture Topics (subject to possible revisions) Note: Details of the recitation sessions will be provided separately Monday, October 26 th Lecture 1. Course introduction. Learning from data: Why, What and How? Lecture 2. How to identify reliable models in Data Mining? Homework #1 handed out; reports due at 6pm on Monday November 02 nd. Friday, October 24 th Lecture 3. Predictive analytics: Classification. Monday, November 02 nd Homework #1 due. Recitations 1 and 2. 4

Homework #2 handed out; reports due at 6pm on Monday November 09 th. Project teams formed and project assignments distributed electronically. Friday, November 06 th Lecture 4. Cost-aware analysis of classifiers. Monday, November 09 th Homework #2 due. Lecture 5. Preprocessing of data. Reduction of dimensionality. Friday, November 13 th Lecture 6. Discovering structural relationships in data: Rules and trees (Part 1). Monday, November 16 th Analytical project first milestone reports due. Recitations 3 and 4. Homework #3 handed out; reports due at 6pm on Monday November 23 rd. Friday, November 20 th Lecture 6. Discovering structural relationships in data: Rules and trees (Part 2). Monday, November 23 rd Homework #3 due. Lecture 7. Descriptive analytics: Density estimation, anomaly detection, and clustering. Friday, November 27 th No class. Thanksgiving break. Monday, November 30 th Analytical project second milestone reports due. Lecture 8. Predictive analytics: Regression. Lecture 9. Estimation of significance. Homework #4 handed out; reports due at 6pm on Monday December 07 th. Friday, December 04 th Recitation 5. Monday, December 07 th 5

Homework #4 due. Lecture 10. Overview of mining complex types of data. Lecture 11. Final Review. Friday, December 11 th Recitation 6. Monday, December 14 th (tentatively at the regular class time and place) Project final reports and presentations due. Final examination. 6