1 DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data analytics in sufficient depth and breadth. The course will provide an overview of how to pose meaningful data analytic problems in a commercial setting. At the end of the course, participants will develop structured thinking approach to transition from data to problem definition. R, an open source tool for data analytics will be introduced in depth. Important and commonly used data analytic and machine learning algorithms will be described in detail. Laboratory sessions will be conducted as part of each module with the expectation that the participants will be able to apply these algorithms on their own data. A capstone case study tailored to the participants field of interest will be solved at the end of the course. Objectives: Introduce the participants to the field of data analytics background and key concepts Introduce the participants to problem types in the area of data analytics possible problem formulation framework Introduce the participants to R an easy to use tool for high level data analytics Introduce the participants to a comprehensive overview of linear algebra and statistics concepts critical concepts for the understanding of data analytic algorithms Introduce the participants to in-depth explanation of the most used data analytic algorithms supported by hands-on work in R from an application viewpoint Introduce the participants to a real life application of data analytics a case study approach. The case study can be chosen by the participants based on their field of interest Modules: Module 1: Data science Introduction Module 2: Data science Class of problems Module 3: R programming Module 4: Statistical modelling Module 5: Data inter-relationships Basics of linear algebra and a brief introduction to nonlinear equations Module 6: Data preparation Module 7: Predictive modelling Module 8: Machine learning techniques Module 9: Introduction to text mining and big data Module 10: Case study Optional modules Pre-requisite: Bachelor s degree with understanding of basic statistics, matrix algebra and probability

2 Module 1: Data science Introduction (3 hours) 1. Participants will get a bird s eye view of the field of data analytics Introduction to big data and data science (The Vs of big data and mathematical concepts that are building blocks for this field) Analytics as a pervasive solution approach cross-cutting disparate problem domains Module 2: Data science Class of problems (3 hours) 1. Participants will learn to apply structured thinking to unstructured problems 2. Participants will be able to categorize and understand various data types 3. Participants will be able to convert imprecise business relevant problem statements to precise data analytic problems 4. Participants will learn the importance of visualization in the data analytics solution process Structured thinking and how it can help Conceptual understanding of data types Importance of quality of data Conceptual understanding of solution typology Introduction to problem formulation framework Impact of visualization Business relevant problem statements Module 3: R programming (6 hours) 1. Participants will be introduced to basics of R programming 2. Participants will be able to write their own programs and will learn to use the already existing data analytics modules in R 3. Participants will be able to import data from Excel, SQL, etc to the R platform RStudio and its GUI Data types Importing and exporting data in R Data preprocessing Matrix algebra Built-in functions Programming in R Data visualization (ggplot)

3 Module 4: Statistical modelling (10 hours) 1. Participants will become well versed in basic probability and statistics concepts 2. Participants will be able to setup hypothesis testing protocols 3. Participants will be able to interpret hypothesis test results Probability Principle of counting Conditional probability, Bayes theorem, independent events Random variables, expectation Continuous and discrete random variables and their distributions (Poisson, Binomial, Normal and its derivatives), statistical intervals Descriptive statistics Hypothesis testing Introduction to elements of hypothesis testing General procedure for hypothesis testing One-sided and two-sided tests p-values, Type I and Type II errors Z, T, F and Chi-squared tests for hypothesis testing Introduction to Bayesian inference Hands-on session in R through examples Module 5: Data inter-relationships Basics of linear algebra and a brief introduction to nonlinear equations (12 hours) 1. Participants will be able to identify relationships between variables in large datasets 2. Participants will be able to identify information sufficiency in terms of both equations and variables 3. Participants will be able to understand basic linear algebra concepts that underlie the complicated data analytics algorithms 4. Participants will be able to understand and interpret solutions to simultaneous nonlinear equations Solving simultaneous linear equations o Independence of the given equations Redundant equations Inconsistent equations o Thinking about an ordered set of variables as vectors o Constructing matrices out of linear equations o Conditions for existence of solutions o Conditions for uniqueness of solutions o Connections between solutions when there are multiple solutions o Elimination approach to solving simultaneous equations

4 Introduction of the notion of distance Perpendicular vectors notion of orthogonality Converting vectors to an orthonormal basis Understanding solutions when the number of variables and equations are different Introduction to simultaneous nonlinear equations Newton-Raphson method for solving nonlinear equations Hands-on session in R through examples Module 6: Data preparation (6 hours) 1. Participants will be able to setup appropriate sampling techniques 2. Participants will be able to apply techniques to address outliers and missing values Sampling techniques o Probability sampling o Non probability sampling Stratified vs. Cluster sampling Treating outliers and missing values Design of Experiments o Single factor o Multiple factors Hands-on session in R through examples Module 7: Predictive modelling (16 hours) 1. Participants will be able to identify relationships between variables through correlation analysis 2. Participants will be able to develop predictive models between variables 3. Participants will be able to rationalize and assess the fidelity of models that are built Correlation o Pearson s correlation o Kendall rank correlation o Spearman rank correlation Regression o Types of regression o Fitting a function Criterion for best fit o Least squares

5 Correlation vs Regression Simple regression Multiple regression Diagnostics and ANOVA Model assessment and validation Non-parametric testing Hands-on session in R through examples Module 8: Machine learning techniques (24 hours) 1. Participants will be able to understand and develop algorithms for classification problems 2. Participants will be able to understand and develop algorithms for function approximation problems 3. Participants will be able to conceptualize novel algorithms Dimensionality reduction methods o Principal component analysis and its variants o Multidimensional scaling Multivariate regression o Ridge regression o Principal component regression o Logistic regression o LASSO Classification methods o Linear discriminant analysis o Quadratic discriminant analysis o K-neighborhood o Naïve Bayes classifier Clustering methods o K means clustering o Fuzzy C-means clustering o Hierarchical clustering Hands-on session in R through examples Module 9: Introduction to text mining and big data (3 hours)

6 Module 10: Case study (4 hours) 1. Participants will learn to solve data analytics problems from conceptualization to the final solution and concomitant visualization of the solution Participants can choose from one of the following domains Case Study 1: Marketing and sales Case Study 2: Accounting Case Study 3: Supply chain management Case Study 4: Financial Case Study 5: Process productivity improvement Case study format: Introduction to the problem (0.5 hours) Participants to identify the problem statement (0.5 hours) Presentation of the problem statement by the participants identification of variables, listing down key assumptions, understanding the data, data preparation requirements, contours of problem solution, visualization specifications (1 hour) Presentation of the problem statement by the instructor (1 hour) Solution for the problem statement in R, results and visualization (1 hour) Optional modules (10 hours for each module) 1. Time series 2. Neural networks 3. Decision trees 4. Natural language processing 5. Multivariate data analysis methods 6. Deep learning (pre-requisite neural networks)

