1 IST687 Applied Data Science Course: Instructor: IST687 Applied Data Science Gary Krudys Semester: E-Mail: Spring 2015 gekrudys@syr.edu Office: 114 Hinds Hall Phone: 315-857-7243 (cell) Office hours: Meeting place: Online only Meeting time: Online Catalog Description Introduces fundamentals about data and the standards, technologies, and methods for organizing, managing, curating, preserving, and using data. Discusses broader issues relating to data management and use as well as quality control and publication of data. Course Description Data scientists play important roles in the four A's of data: data architecture, data acquisition, data analysis and data archiving. A data scientist provides input to system architects on how data needs to be routed and organized to support analysis, visualization and presentation of the data to the appropriate people. Data acquisition, from a data scientist's standpoint, is a critical phase in which data are obtained in the right format, linked to other relevant data and screened to ensure appropriateness for analysis. Analysis is where the data scientists are most heavily involved: summarization of the data, using portions of data (samples) to make inferences about the larger context and visualization of the data by presenting it in tables, graphs and even animations. Finally, the data scientist must become involved in the archiving of the data. Preservation of collected data is a form that makes it highly reusable - what you might think of as "data curation" - is a difficult challenge because it is so hard to anticipate all of the future uses of the data. This course provides an overview of all four areas, as well as the opportunity to ramp up on the popular open source data science tool, the R open source statistical analysis and visualization system. R is reckoned by many to be the most popular choice among data analysts worldwide; having some knowledge and skill with using it is considered a valuable and marketable job skill for most data scientists. The course will include applied examples of data collection, processing, transformation, management, and analysis to provide students with hands-on experience. Students will have opportunities to practice the methods and tools for planning a data analysis project from data collection to the end data products and reports with data quality management and data product development in between. The evaluation of each student will arise from exercises, reports, class discussion, and a course project. Learning Objectives At the end of the course, students are expected to understand: Essential concepts and characteristics of data Scripting/code development for data management Principles and practices in data screening, cleaning, and linking
2 Sampler of technologies used to manage, manipulate, and analyze data Communication of results to decision makers At the end of the course, students are expected to be able to: Identify a problem and the data needed for addressing the problem Perform basic computational scripting using R and other optional tools Transform data through processing, linking, aggregation, summarization, and searching Organize and manage data at various stages of a project lifecycle What does it take to succeed in the course? An interest and passion in data science career in the corporate, academic, or government sector Curiosity on natural and social phenomena and basic computer skills Close familiarity with algebra, geometry, and trigonometry Basic understanding of descriptive statistics Motivation to learn and excel Textbooks: Introduction to Data Science (2013) V3, by Jeffrey M. Stanton. Available for free in the itunes bookstore or as a PDF download at http://jsresearch.net. The instructor will provide additional and supplemental readings in Blackboard as electronic documents for downloading and printing. Students are expected to read the assigned materials for discussions and coursework. Contributions to Grade The work for this class will involve a mixture of individual assignments, reports, and a final project. Exercises (11 x 4% = 44%) are designed for you to practice the necessary skills in carrying out data processing, analysis, and management tasks. o 4 = Solid / no mistakes (or really minor), well commented/documented o 3 = Good / some mistakes o 2 = Fair / some major conceptual errors o 1 = Poor / did not finish o 0 = Did not participate / did not hand in Mid-term quiz (15%) is designed to evaluate your mastery of concepts, methods, and tools in data analysis and management. Note: quiz questions are graded out of 100. Your grade will be weighted by 15% o Content Coverage: Data Science ebook chapters Overview through chapter 8 o Format: multiple choice
3 Final group project (30%): For the final project you will identify a data set or a data source, transform the data as needed, and provide an analysis with visualizations. Final group project framework and deliverables are discussed in detail via the course Blackboard site. Weekly Class Discussions (11%) includes participation in weekly class discussions on Blackboard, both as discussion leader and participant. Discussion questions should be relevant to the chapter material. Contributions should be germane to the discussion question, well thought out, substantive, clear, and concise and will be assessed on these criteria. Discussion leaders are responsible for identifying and posting a chapter relevant discussion question. Discussion leaders are also responsible for monitoring and responding to postings in their respective discussion thread. Grading Policy Each assigned work will be graded on the scale as specified for the component, which will be summed at the end of the semester. Grade levels follow the scales below: A = 95-100, A- = 90-94.9, B+ = 85-89.9, B = 80-84.9, B- = 75-79.9, C+ = 70-74.9, C = 65-69.9, C- = 60-64.9, F = below 60 An incomplete grade, I, can be given only if the circumstances preventing the on-time completion of all course requirements were clearly unforeseeable and uncontrollable. If an incomplete is required a written contract must be completed which specifies the nature of the missing work, the date it will be completed, and the default grade that will be given if that deadline is missed. It is unethical to allow some students additional opportunities, such as extra credit assignments, without allowing the same options to all students. To discuss a grade, arrange for a private meeting in which you identify the sources of your concern. It is important to bring with you to that meeting the relevant materials (e.g., marked papers). Except for extraordinary circumstances, no appeal for an individual assignment or project will be considered later than two weeks after the graded assignment was returned. For final grades, no appeal will be considered after one week of final project submission date. Preferred contact methods: If you have a question that you don t mind sharing with the class, post in the Questions to the Professor discussion area. If you have a more private matter, you can email me at my syr.edu address: gekrudys@syr.edu Expect responses within 48 hours, Monday through Friday, primarily before 8:00a EST and after 5:00p EST. Occasionally I will be able to respond during normal business day hours of 8:00a EST 5:00p EST.. My response hours may seem a bit odd primarily because I work full time as a Sr. Business Intelligence Analyst for a Regional Health Information Organization. Academic Integrity The academic community of Syracuse University and of the School of Information Studies requires the highest standards of professional ethics and personal integrity from all members of the community.
4 Violations of these standards are violations of a mutual obligation characterized by trust, honesty, and personal honor. As a community, we commit ourselves to standards of academic conduct, impose sanctions against those who violate these standards, and keep appropriate records of violations. The academic integrity statement can be found at: http://www.ist.syr.edu/courses/advising/integrity.asp Student with Disabilities In compliance with section 504 of the Americans with Disabilities Act (ADA), Syracuse University is committed to ensure that no otherwise qualified individual with a disability shall, solely by reason of disability, be excluded from participation in, be denied the benefits of, or be subjected to discrimination under any program or activity If you feel that you are a student who may need academic accommodations due to a disability, you should immediately register with the Office of Disability Services (ODS) at 804 University Avenue, Room 308 3rd Floor, 315.443.4498 or 315.443.1371 (TTD only). ODS is the Syracuse University office that authorizes special accommodations for students with disabilities.
5 IST687 spring 2015 schedule as of 11.17.14 Unit/ Week 0 1/5/15 Topics Readings Activities/Assignments Course Introduction Syllabus Walk Through Course navigation Subject content Exercises/Assignments Discussion threads Grading Course communication Introduction Lecture Blackboard Complete and post Business Profile Access R-Bloggers site http://www.r-bloggers.com/ Follow instructions to subscribe to daily newsletter 1 1/12/15 Data Science Overview Overview to Data Science, pp. ii-7 Section 1: Data Science, Many Skills Install the R open source software package on your computer. Submit a screen shot of R to the instructor. 2 1/19/15 3 1/26/15 Essential concepts of data. Intro to DS, pp. 8-13 Chapter 1: About Data Using R to manipulate data. Intro to DS, pp. 14-23 Chapter 2: Identifying Data Problems; Chapter 3: Getting Started with R Exercise 1: Boolean logic. Install RStudio Submit a screen shot of RStudio control panels 4 2/2/15 Data frames and other data structures Intro to DS, pp. 24-36 Chapter 4: Follow the Data; Chapter 5: Rows and Columns Exercise 2: myfamily data structure Exercise 3: Manipulating data frames 5 2/9/15 Descriptive Statistics, Distributions, and Histograms Intro to DS, pp. 37-48 Chapter 6: Beer, Farms, and Peas Exercise 4: Generating distributions 6 2/16/15 Introduction to sampling and inferential statistics Intro to DS, pp. 49-59 Chapter 7: Sample in a Jar Exercise 5: Decision Making Thresholds 7 2/23/15 Identifying a data source for your final project Intro to DS, pp. 60-64 Chapter 8: Big Data: Big Deal Midterm Quiz 8 3/23/15 Code creation with RStudio Intro to DS, pp. 65-75 Chapter 9: Onward with R Studio Exercise 6: Code your own MySamplingDistribution() 3/9/15 Semester Break Semester Break Semester Break
6 9 3/16/15 Connecting with external data sources Intro to DS, pp. 132-147 Chapter 14: Storage Wars Exercise 7: Running SQL queries from R Submit 1 page proposal for the dataset or data source you plan to use for your final project. 10 3/23/15 Working with map data Intro to DS, pp. 148-159 Chapter 15: Map Mash-Up) Exercise 8: Color code map points 11 3/30/15 Linear modeling Intro to DS, pp. 160-169 (Chapter 16: Line Up, Please Exercise 9: Evaluating linear models Submit 2 page proposal outlining data analysis plan. 12 4/6/15 Association Rules Mining Intro to DS, pp. 170-182 Chapter 17: Hi Ho, Hi Ho-Data Mining We Go Exercise 10: Apply association rules mining to a novel data set 13 4/13/15 Support Vector Machines Intro to DS, pp. 183-193 Chapter 18: What s your vector, Victor? Exercise 11: Apply support vector analysis to a novel data set Submit 3 page project report describing results of data screening, cleaning, and linking. 14 4/20/15 Final project problem diagnosis and support No reading No exercise 15 4/27/15 Final project problem diagnosis and support No reading Final project submissions due.
7