CS 207 - Data Science and Visualization Spring 2016



Similar documents
Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Office: LSK 5045 Begin subject: [ISOM3360]...

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

Statistics W4240: Data Mining Columbia University Spring, 2014

Course Syllabus. Purposes of Course:

BUSA 501: Introduction to Business Analytics

Course Description This course will change the way you think about data and its role in business.

Machine Learning Capacity and Performance Analysis and R

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

DATA MINING FOR BUSINESS INTELLIGENCE. Data Mining For Business Intelligence: MIS 382N.9/MKT 382 Professor Maytal Saar-Tsechansky

Spring 2013 CS 6930 Advanced Topics in Web Security and Privacy - 3 Credit Hours Syllabus and Course Policies

King Saud University

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

: Introduction to Machine Learning Dr. Rita Osadchy

B490 Mining the Big Data. 0 Introduction

SI 539, Winter 2014 Complex Web Design

Alabama Department of Postsecondary Education

MS1b Statistical Data Mining

INFSCI 1017 Implementation of Information Systems

Lake-Sumter Community College Course Syllabus. STA 2023 Course Title: Elementary Statistics I. Contact Information: Office Hours:

After completing SI- 539, students will have a working personal portfolio website in production.

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

UNIVERSITY OF SOUTHERN CALIFORNIA Marshall School of Business BUAD 425 Data Analysis for Decision Making (Fall 2013) Syllabus

MAT Elements of Modern Mathematics Syllabus for Spring 2011 Section 100, TTh 9:30-10:50 AM; Section 200, TTh 8:00-9:20 AM

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

BUSINESS INTELLIGENCE WITH DATA MINING FALL 2012 PROFESSOR MAYTAL SAAR-TSECHANSKY

How To Learn Data Analytics

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

I INF 300: Probability and Statistics for Data Analytics (3 credit hours) Spring 2015, Class number 9873

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

COMP-557: Fundamentals of Computer Graphics McGill University, Fall 2010

How To Teach C++ Data Structure Programming

CS 5890: Introduction to Data Science Syllabus, Utah State University, Fall

SoSe 2014: M-TANI: Big Data Analytics

Finite Mathematics I / T Section / Course Syllabus / Spring Math 1324-T10 Mon/Wed/Fri 10:00 am 11:50 am MCS 215

Prerequisite Math 115 with a grade of C or better, or appropriate skill level demonstrated through the Math assessment process, or by permit.

Interactive Web Development ITP 301 (4 Units)

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

F l o r i d a G u l f C o a s t U n i v e r s i t y S t a t i s t i c a l M e t h o d s F a l l C R N

San José State University CS160, Software Engineering, Sections 1, 2, and 4, Fall, 2015

MGMT 280 Impact Investing Ed Quevedo

CS 1340 Sec. A Time: 8:00AM, Location: Nevins Instructor: Dr. R. Paul Mihail, 2119 Nevins Hall, rpmihail@valdosta.

Intro. to Data Visualization Spring 2016

CS 300 Data Structures Syllabus - Fall 2014

KENNESAW STATE UNIVERSITY GRADUATE COURSE PROPOSAL OR REVISION, Cover Sheet (10/02/2002)

4ECE 320 Signals and Systems II Department of Electrical and Computer Engineering George Mason University Fall, 2015

Azure Machine Learning, SQL Data Mining and R

College of Health and Human Services. Fall Syllabus

Math 1302 (College Algebra) Syllabus Fall 2015 (Online)

INSTRUCTOR INFORMATION Instructor: Adrienne Petersen Office: DMS 233 Office Hours: TuTh 11am-1pm by appointment

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

Del Mar College - Mathematics Department SYLLABUS for the Online Calculus for Business and Social Science - Math fa

Governors State University College of Business and Public Administration. Course: STAT Statistics for Management I (Online Course)

Mathematics for Business Analysis I Fall 2007

Introduction: How does a student get started? How much time does this course require per week?

Machine Learning using MapReduce

FINC 6532-ADVANCED FINANCIAL MANAGEMENT Expanded Course Outline Spring 2007, Monday & Wednesday, 5:30-6:45 p.m.

Big Data and Analytics (Fall 2015)

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

SYLLABUS Honors College Algebra MAC 1105H / 3 credit hours Fall 2014

Faculty of Science School of Mathematics and Statistics

Introduction to Computer Forensics Course Syllabus Spring 2012

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

STAT 2300: BUSINESS STATISTICS Section 002, Summer Semester 2009

4-letter Designator Prefix Course Number Suffix

CSC 314: Operating Systems Spring 2005

MD - Data Mining

Information Management course

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Instructor Contact Information:

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

STAT 360 Probability and Statistics. Fall 2012

**SYLLABUS IS SUBJECT TO CHANGE**

Knowledge Discovery and Data Mining

Machine Learning with MATLAB David Willingham Application Engineer

COMM 430 / DIGITAL DESIGN / SPRING 2015

PAF 410: Intro to Web Design - Fall 2014

INF 203: Introduction to Network Systems (3 credit hours) Spring W1, Class number 9870

Knowledge Discovery and Data Mining

Holidays Faculty Deadlines Financial Aid Fall 2014: Classes Begin: August 25, 2014 August 25 - December 13, 2014

Fluency in Information Technology

Proposal for Undergraduate Certificate in Large Data Analysis

Tennessee Wesleyan College Math 131 C Syllabus Spring 2016

CS135 Computer Science I Spring 2015

Scope of this Course. Database System Environment. CSC 440 Database Management Systems Section 1

ACT Mathematics sub-score of 22 or greater or COMPASS Algebra score of 50 or greater or MATH 1005 or DSPM 0850

HCC ONLINE COURSE REVIEW RUBRIC

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Multimedia & the World Wide Web

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Transcription:

CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These big data techniques are applied to data sets from multiple disciplines and include cluster, network, and other analytical methods paired with appropriate visualizations. Course cap: 24 students. Includes a required lab section. Textbooks Selections from the following textbooks will be used. They are all available either for free or in the science library. Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman. Available free at: http://infolab.stanford.edu/~ullman/mmds/booka.pdf. We ll call this book Data Mining. Visualization Design and Analysis: Abstractions, Principles, and Methods by Tamara Munzner. We ll call this book Visualization. The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Available free at: http://www-stat.stanford.edu/~tibs/elemstatlearn/. We ll call this book Statistical Learning. Interactive Data Visualization for the Web by Scott Murray. Available free at: http: //chimera.labs.oreilly.com/books/1230000000345/index.html. We ll call this book D3. Prerequisites All three of the following categories satisfied, or permission of the instructor. CS 105 with a grade of 2.0 or better CS 106 with a grade of 2.0 or better CS 231 with a grade of 2.0 or better Topics There will be approximately one lab per main topic listed below, on which students will apply the analysis techniques from that section to a given data set. 1. Statistical Background (2 weeks) probability distributions random numbers regression analysis 2. Visualization Background (1 week) data abstractions

Labs visual encoding principles 3. Cluster Analysis (5 weeks) nearest neighbors hierarchical clustering centroid-based clustering (k-means and variants) visualization - hierarchical visualizations, heat maps, matrix views 4. Network Analysis (3 weeks) graph theory basics (matrix representation, etc.) centrality PageRank visualization - layout options, force-directed placement, color 5. Supervised Learning (2 weeks) Model evaluation Decision trees Overfitting Ensembles and boosting Naive Bayes 0. HTML / CSS basics. Make a personal website using bootstrap. You may pass out of this part of the lab by sending a link to a previously created website. Set up d3. 1. Data cleaning. 2. d3 visualization basics via linear regression graphing. 3. Nearest neighbor searching. 4. Hierarchical clustering. 5. k-means clustering and finding k. 6. PageRank.

Schedule - week by week This schedule is tentative. Students should expect at least 10 hours of work each week. For the most up-to-date dates and deadlines see the course Google Calendar. Reading should be done during or before the week in which it is listed. The topics of the week will be based on, but not exclusive to, the reading. 1. Introduction to data science and visualization. Introduction to specific data sets (with visits by some professors). Reading: Chapter 1 in Statistical Learning, Chapters 1-4 in D3, Chapter 1 in Visualization, Chapter 1 in Data Mining Lab work (lab 0): Make a personal website. Set up d3. 2. Probability basics, Gaussians, Linear Regression. Reading: Chapter sections 2.1, 2.2, and 2.3.1 in Statistical Learning (linear regression), Chapters 3-6 in D3 Lab 0 Due: HTML / CSS basics with bootstrap. Due Thursday: an email to Sorelle listing (in order) your top three data set preferences. If you are a scientific computing concentrator, note this in your email. Lab work: More D3 basics. 3. Visualization frameworks Reading: Chapters 2 and 3 in Visualization, Chapters 3-6 in D3. Lab work: Work on lab 1. Lab 1 Due: data cleaning. 4. Nearest neighbors and kth nearest neighbors Reading: Chapter 3.1(nearest neighbors) and 3.5 (distance measures) in Data Mining, Chapter sections 2.3.2, 2.3.3 (nearest neighbors), and 13.3 (kth nearest neighbors) in Statistical Learning. For information about nearest neighbors also see this book (especially the introduction): http://tripod.brynmawr.edu.ezproxy. haverford.edu/find/record/.b3282286 5. Hierarchical clustering and visualization Reading: Chapter 7.2 in Data Mining, Chapter 9.2 in Visualization Lab 2 Due: d3 basics and linear regression graphing. 6. Clustering overview, some clustering via proximity (k-center, etc.), k-means clustering, Lloyd s algorithm Reading: Chapter 7.1 in Data Mining, Chapter 13.1 in Statistical Learning, 7.3 in Data Mining, Suresh s clustering series posts 2-5 (http://geomblog.blogspot. com/p/conceptual-view-of-clustering.html)

7. Midterm exam week Lab 3 Due: Clustering lab 1 - nearest neighbor searching Wednesday, March 2nd - Midterm exam. 8. SPRING BREAK! 9. Choosing k and visualization with filtering, force-directed layout, correlation clustering, force-directed layout Reading: Chapter 10 in Visualization, correlation clustering Wikipedia page and this blog post: http://blog.computationalcomplexity.org/2006/08/correlation-clustering. html 10. Network analysis intro: adjacency matrices, adjacency lists, graph theory intro, network visualizations, color, network analysis basics. Dijkstra s algorithm. Reading: 7.2 (link marks) and 7.3 (color) in Visualization Lab 4 Due: Clustering lab 2 - hierarchical clustering. 11. Betweenness centrality and PageRank. Reading: Chapter 5.1 and 5.2 (PageRank) in Data Mining and 14.10 (PageRank) in Statistical Learning 12. Supervised learning: Evaluating training vs. test data. Linear regression as a model. Decision trees. Over fitting. Ensembles. Boosting. Lab 5 Due: Clustering lab 3 - k-means clustering and finding k. 13. Naive Bayes. Visualization case studies (from the NY Times). Memory limitations. 14. Advanced topics. Use both class and lab time to work on your posters and your lab. Lab 6 Due: Network analysis lab - PageRank 15. Data set discussions and poster session. Poster printing appointments at the KINSC office Due Wednesday: Poster session during class time. Total grade breakdown Participation and Attendance 5% Labs 35% Midterm 20% Final Project 40% Grades will be awarded based on the number of points earned and according to the percentage breakdowns shown. Students will not be graded on a curve.

Final Project Students will work to analyze a data set throughout the semester. They will be responsible for choosing an appropriate analysis method and creating an associated visualization. Based on their findings, they will write a research paper including a description of their methods and the analysis performed, an explanation of their findings, and the visualization produced. See the separate project details description for more information. Late work policy All extensions must be requested at least 24 hours in advance of the deadline. Extensions will be granted based on individual circumstances. Work handed in late without a previously granted extension may not be accepted. Rules and Pet Peeves Be on time. This includes class, lab, office hours, and appointments. No computer use in class without approval. Computers should only be used in class to take notes. This includes laptops, tablets, phones, etc.. Expect 24 hours before an email response and read all emails within 24 hours. Attend all classes and labs. Collaboration You are encouraged to discuss the lecture material and the labs and problems with other students, subject to the following restriction: the only product of your discussion should be your memory/understanding of it - you may not write up solutions together, or exchange written work or computer files. Any group projects are the only exception to this - in these cases, these collaboration rules apply to students outside of your group and you may freely work closely with students within your group. Collaboration is not allowed on examinations or quizzes. As usual, anything taken from outside sources should be cited. Code should not be copied without permission from the instructor. If permission is given, code should be cited at the location it is used with a comment. Learning Accommodations Haverford College is committed to supporting the learning process for all students. Please contact me as soon as possible if you are having difficulties in the course. There are also many resources on campus available to you as a student, including the Office of Academic Resources (https://www.haverford.edu/oar/) and the Office of Access and Disability Services (https://www.haverford.edu/access-and-disability-services/). If you think you may need accommodations because of a disability, you should contact Access and Disability Services at hc-ads@haverford.edu. If you have already been approved to receive academic accommodations and would like to request accommodations in this course because of a disability, please meet with me privately at the beginning of the semester (ideally within the first two weeks) with your verification letter.