Preliminary Syllabus for the course of Data Science for Business Analytics Miguel Godinho de Matos 1,2 and Pedro Ferreira 2,3 1 Cato_lica-Lisbon, School of Business and Economics 2 Heinz College, Carnegie Mellon University 3 Department of Engineering and Public Policy, Carnegie Mellon University miguel.godinhomatos@clsbe.lisboa.ucp.pt, pedrof@cmu.edu 2015/2016 1 Course overview Firms create massive amounts of data as by-products of their activity. The volume and speed with which such data is created makes it increasingly necessary for managers to leverage on intelligent systems capable of processing large volumes of information in real time to improve decision making. In this course we will study how business experimentation and data analysis technologies can be used to improve business knowledge and decision making. We will learn about fundamental principles and techniques of predictive modeling data analysis and causal inference. We will examine real-world examples and cases of the application of such tools. We will work hands-on with state-of-the-art data analysis software. After taking this course students should be able to: _ Have hands-on experience with data analytics. _ Be able to think systematically about how and when data can improve decision making in contexts of management, marketing investments, etc. _ Be able to understand and discuss topics of data analysis for business intelligence. In particular, know basic principles and algorithms of data mining to interact with data analytics professionals. _ Be able to design simple experiments to improve business knowledge and decision making. 1
2 Course Participation Rules Lectures will cover examples of the fundamental principles and uses of data analytics and data mining. This is not a data mining algorithms course, but we will discuss the mechanics of how these methods work. Class meetings will be a combination of lectures on fundamental material, case discussions and student exercises. Reading assignments will cover the core material and we expect that students will be prepared for class discussions. Students should attend every class session. Failure to do so will have a direct impact on class grade. I will check my email at least once a day during the week (Monday through Friday). Please use the special tag [ 2014 - Business Analytics ] in the subject header of the e-mail. I use this tag to make sure I process class email _rst. If you fail to include the special tag, I may not read the email for a long time. 3 Course Readings The mandatory textbook for the class will be: Data Science for Business: Fundamental principles of data mining and data analytic thinking Provost and Fawcett (2013). We will complement the book with discussions of applications, cases, and demonstrations.whenever relevant, we will hand out lecture notes. We expect that you ask questions about any material in the notes that is not clear after the corresponding class and after reading the book. Depending on the direction our class discussion takes, we may not cover all material that is initially planed for any particular session. If the notes and the book are not adequate to explain a topic that we skip, you should ask about it by e-mail. I will be happy to follow up and provide you with additional references. 4 Grading The grade breakdown is as follows: _ Participation - 10% _ Home work - 40% _ Final Exam - 50% 2
4.1 Participation You are expected to attend every class session, to arrive on time, to remain for the entire class, and to follow basic classroom etiquette. Basic class etiquette includes disconnecting all electronic devices for the duration of the class (unless otherwise noticed). You are expected to participate in class discussions and understand the material presented in previous lectures. 4.2 Homework Each homework will comprise questions to be answered and/or hands-on tasks. Except as explicitly noted otherwise, you are expected to complete your assignments on your own. The hands-on tasks will be based on data that we will provide. You will mine the data to get hands-on experience in formulating problems and using the various techniques discussed in class. You will use these data to build and evaluate predictive models. For the hands-on assignments we will use the R statistical language http://cran.r-project.org/. We also recommend that you use the open source version of R-Studio http://www.rstudio.com/ as your development environment. In order to use R, you must have access to a computer where you can install software. If you do not have such a computer, please see me immediately so we can make alternative arrangements. You should bring your computer to class. We will help you install and con_gure the software in the _rst class. 4.3 Final Exam The subject matters covered and the exact dates will be discussed in class. 5 Class Contents 1. Introduction to data mining and business analytics (a) Data Analytics Thinking (b) From Big Data 1.0 to Big Data 2.0 (c) From Business Problems to Data Mining (d) Supervised Vs. Unsupervised Data Analysis (e) The Process of Data Mining 2. Introduction to predictive modeling 3
(a) Finding informative attributes (b) Tree induction (c) Probability estimation 3. Model _t and model over_t (a) Finding \optimal" model parameters based on data (b) Choosing the goal for data mining (c) Objective functions (d) Loss functions (e) Generalization (f) Fitting and over_tting (g) Complexity control 4. Model quality and performance evaluation (a) Evaluating classi_ers (b) Expected value as key evaluation framework (c) Visualizing model performance (ROC, Lift curve, Cumulative response, Pro_t curve) 5. Introduction to the paradigm of causal inference (a) Limits of data mining (b) Correlation versus causation (c) Treatment, control, outcomes and randomized experiments (d) Power and sample size 6. Randomized experiments in the wild (a) Several case discussions (Microsoft, Goodle, Bing, Facebook, Our own work, etc.) 4
6 Class Schedule Class Instructor Topics Readings Deliverables Number 1 MGM Introduction to data mining and Chp 1, 2 Info Sheet (in class) business analytics 2 MGM Introduction to predictive modeling Chp 3 Homework 1 due 3 MGM Model _t and model over_t Chp 4, 5 Homework 2 due 4 MGM Model quality and performance Chp 7, 8 Homework 3 due evaluation 5 PF Introduction to the paradigm of Notes Homework 4 due causal inference 6 PF Randomized Experiments in the wild Notes 7 Instructor Bios Miguel Godinho de Matos (MGM) is visiting assistant professor of Information Systems and Management at Cato_lica Lisbon School of Business and Economics. He is also a visiting scholar at the Heinz College from Carnegie Mellon University. He received a Ph.D. in Telecommunications Policy and Management and a M.Sc. in Engineering and Public Policy from Carnegie Mellon University. Miguel's research interests focus on the analysis of social networks and peer in uence on consumer behavior and the impact of digitization on consumer search and choice. Miguel has published his work in top journals and top peer-reviewed research conferences such as Management Information Systems Quarterly, the International Conference of Information Systems, IEEE Conference on Social Computing and the Economics of Digitization Seminar Series of the National Bureau of Economic Re-search. Pedro Ferreira(PF) is an assistant professor of Information Systems and Management at the Heinz College, Carnegie Mellon University. He received a Ph.D. in Telecommunications Policy from CMU and a M.Sc. in Electrical Engineering and Computer Science from MIT. Pedro's research interests lie in two major domains: identifying causal eects in dense network settings, with direct application to understanding the future of the digital media industry, and the evolving role of technology in the economics of education. Currently, he is working on a series of large scale randomized experiments in network settings looking at identifying the role of peer in uence in the consumption of media. Pedro has published in top journals and top peerreviewed research conferences such as Management Science, Man-agement Information Systems Quarterly and the IEEE Conference on Social Computing.
5
8 O_ce Hours Miguel Godinho de Matos' o_ces hours will be detailed in the _rst lecture of the course. Pedro Ferreira will be on campus only for the last sessions of the course. He will not have o_ce hours. Pedro will be available to meet by appointment during his stay at Cat_olica Lisbon School of Business and Economics. Details will be provided in class. 6