Exploratory Data Analysis with R. @matthewrenze #codemash



Similar documents
d3.js Data-Driven Documents Scott Murray, Jerome Cukier & Jeffrey Heer VisWeek 2012 Tutorial

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Introduction to Data Visualization

?????? Data Analytics

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Data Analysis and Statistical Software Workshop. Ted Kasha, B.S. Kimberly Galt, Pharm.D., Ph.D.(c) May 14, 2009

11/17/2015. Learning Objectives. What Is Data Mining? Presentation. At the conclusion of this presentation, the learner will be able to:

Strategies For Setting Up Your Organisation For Success With Big Data. Kevin Long Business Development Director Teradata

SkySpark Tools for Visualizing and Understanding Your Data

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Additional sources Compilation of sources:

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Marketing Research Core Body Knowledge (MRCBOK ) Learning Objectives

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

GETTING STARTED WITH R AND DATA ANALYSIS

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Introduction to Agile and Scrum

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

Week 1. Exploratory Data Analysis

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Information Visualization WS 2013/14 11 Visual Analytics

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Course Descriptions: Undergraduate/Graduate Certificate Program in Data Visualization and Analysis

Cleaned Data. Recommendations

Geostatistics Exploratory Analysis

2015 Workshops for Professors

Creating Page Layouts using SharePoint Designer or Visual Studio

Exploratory Data Analysis for Ecological Modelling and Decision Support

Data Exploration Data Visualization

Microsoft Azure Machine learning Algorithms

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Analytics on Big Data

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

2. A typical business process

An introduction to using Microsoft Excel for quantitative data analysis

an introduction to VISUALIZING DATA by joel laumans

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Diagrams and Graphs of Statistical Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Anomaly detection. Problem motivation. Machine Learning

Essentials of Marketing Research

Data exploration with Microsoft Excel: analysing more than one variable

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

The Need for Training in Big Data: Experiences and Case Studies

System Management Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

ADD-INS: ENHANCING EXCEL

Data exploration with Microsoft Excel: univariate analysis

Introduction Course in SPSS - Evening 1

Analytics: The Future of Security

Dimensionality Reduction: Principal Components Analysis

By Ken Thompson, ServQ Alliance

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Software/Applications Programmer Technical Writer E-Commerce Manager. Computer and Electronics Repair Interactive Media Developer

Final Project Report

Data, Measurements, Features

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

January 26, 2009 The Faculty Center for Teaching and Learning

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Customer Analytics. Turn Big Data into Big Value

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Chapter 4 Displaying and Describing Categorical Data

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Analyzing and interpreting data Evaluation resources from Wilder Research

COGNOS 8 Business Intelligence

Foundation of Quantitative Data Analysis

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Business Statistics MBA Course Outline

This chapter reviews the general issues involving data analysis and introduces

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

White Paper April 2006

Data Analytics in Organisations and Business

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Machine Learning using MapReduce

Text Mining - Scope and Applications

Republic Polytechnic School of Information and Communications Technology C355 Business Intelligence. Module Curriculum

Introduction to Longitudinal Data Analysis

All Visualizations Documentation

Exploratory data analysis (Chapter 2) Fall 2011

EXPLORATORY DATA ANALYSIS

INTRODUCING DATA ANALYSIS IN A STATISTICS COURSE IN ENVIRONMENTAL SCIENCE STUDIES

ANALYTICS IN ACTION WINNING IN ASSET MANAGEMENT DISTRIBUTION WITH PREDICTIVE SALES INTELLIGENCE WHITEPAPER

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Transcription:

Exploratory Data Analysis with R @matthewrenze #codemash

Motivation The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill in the next decades, because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it. Hal Varian, Google s Chief Economist The McKinsey Quarterly, Jan 2009

Source: Dice 2014 Tech Salary Survey Results

A Flood of Data is Coming... Source: http://www.dot.gov.nt.ca/ Source: Wikipedia Sink or Swim

Overview Introduction to R Data Munging Descriptive Statistics Data Visualization Beyond R and EDA

How Does This Apply to Me? As a software developer, I often: Perform log file analysis Analyze software performance Analyze code metrics for code quality Detect anomalies in source data Transform or clean data files to make them usable Help decision makers make decisions based on data

About Me Independent software consultant Education B.S. in Computer Science B.A. in Philosophy Community Pluralsight Author ASPInsider Public Speaker Open-Source Software

Introduction to R

What is R? Open source Language and environment Numerical and graphical analysis Cross platform

What is R? Active development Large user community Modular and extensible 6700+ extensions and best of all

FREE

FREE

Source: http://redmonk.com/sogrady/2014/06/13/language-rankings-6-14/

Source: www.rstudio.com/ide/

Code Demo

Data Munging

Data Munging Transforming data Raw data to usable data Data must be cleaned first

Data Munging Tasks Renaming variables Data type conversion Encoding, decoding, or recoding data Merging data sets Transforming data Handling missing data (imputing) Handling anomalous values

Loading Data in R File-based data Web-based data Databases Statistical data And many more CSV XML

Cleaning Data This step is often the: Most difficult Most time consuming TIP: Record all steps

Column with wrong name Rows with missing values Runtime column has units Revenue in multiple scales Wrong file format

Code Demo

Descriptive Statistics

Descriptive Statistics Describe data Provides a summary aka: Summary statistics Movie Runtime Statistic Value (minutes) Minimum 38 1 st Quartile 93 Median 101 Mean 104 3 rd Quartile 113 Maximum 219

Statistical Terms Observations Variables Qualitative variable Quantitative variable ID Date Customer Product Quantity 1 2015-08-27 John Pizza 2 2 2015-08-27 John Soda 2 3 2015-08-27 Jill Salad 1 4 2015-08-27 Jill Milk 1 5 2015-08-28 Miko Pizza 3 6 2015-08-28 Miko Soda 2 7 2015-08-28 Sam Pizza 1 8 2015-08-28 Sam Milk 1

Number of Variables Types of Analysis Number of variables Univariate Bivariate Multivariate Qualitative Univariate Analysis Quantitative Univariate Analysis Type of variables Qualitative Quantitative Qualitative Bivariate Analysis Qualitative & Quantitative Bivariate Analysis Quantitative Bivariate Analysis Multivariate Analysis Type of Variable(s)

Univariate Analysis One variable Qualitative Frequency Quantitative Central tendency Dispersion

Bivariate Analysis Qualitative Joint frequency Quantitative Two variables Predictor Outcome Measures Covariance Correlation

Cowboys & Space Invaders: The Musical Extended Edition

Code Demo

Cowboys & Space Invaders: The Musical Extended Edition

Data Visualization

Data Visualization Visual data representation For human pattern recognition Map dimensions to visual characteristics

Data Visualization ID Date Customer Product Quantity 1 2015-08-27 John Pizza 2 2 2015-08-27 John Soda 2 3 2015-08-27 Jill Salad 1 4 2015-08-27 Jill Milk 1 5 2015-08-28 Miko Pizza 3 6 2015-08-28 Miko Soda 2 7 2015-08-28 Sam Pizza 1 8 2015-08-28 Sam Milk 1

Number of Variables Types of Data Visualizations Number of variables Univariate Bivariate Multivariate Qualitative Univariate Analysis Quantitative Univariate Analysis Type of variable(s) Qualitative Quantitative Qualitative Bivariate Analysis Qualitative & Quantitative Bivariate Analysis Quantitative Bivariate Analysis Multivariate Analysis Type of Variable(s)

Cowboys & Space Invaders: The Musical Extended Edition

Code Demo

The Warlord of the Rings : An Unexpected Adventure Feature Length PG

Beyond R and EDA

This is just the tip of the iceberg! This is just the tip of the iceberg!

Advanced Data Analysis with R Cluster Analysis Statistical Modeling Dimensionality Reduction Analysis of Variance (ANOVA)

Source: Nathan Yau (www.flowingdata.com)

Data Mining and Machine Learning with R

Alternatives to R for EDA

Source: Microsoft

Where to Go Next R website: http://www.cran.r-project.org R Studio: http://www.rstudio.com Revolutions: http://blog.revolutionanalytics.com Flowing Data: http://flowingdata.com R-Blogger: http://www.r-bloggers.com

www.pluralsight.com/courses/r-data-analysis

Conclusion

Conclusion Introduction to R Data munging Descriptive statistics Data visualization Beyond R & EDA

Feedback Feedback is very important to me! One thing you liked? One thing I could improve?

Contact Info Matthew Renze Twitter: @matthewrenze Email: matthew@renzeconsulting.com Website: www.matthewrenze.com