Chapter 2 Exploratory Data Analysis
|
|
|
- Basil Wheeler
- 10 years ago
- Views:
Transcription
1 Chapter 2 Exploratory Data Analysis 2.1 Objectives Nowadays, most ecological research is done with hypothesis testing and modelling in mind. However, Exploratory Data Analysis (EDA), which uses visualization tools and computes synthetic descriptors, is still required at the beginning of the statistical analysis of multidimensional data, in order to: Get an overview of the data Transform or recode some variables Orient further analyses As a worked example, we explore a classical dataset to introduce some techniques of EDA using R functions found in standard packages. In this chapter, you will: Learn or revise some bases of the R language Learn some EDA techniques applied to multidimensional ecological data Explore the Doubs dataset in hydrobiology as a first worked example 2.2 Data Exploration Data Extraction The Doubs dataset used here is available in the form of three comma separated values (CSV) files along with the rest of the material (see Chap. 1). D. Borcard et al., Numerical Ecology with R, Use R, DOI 1.17/ _2, Springer Science+Business Media, LLC 211 9
2 1 2 Exploratory Data Analysis Hints At the beginning of a session, make sure to place all necessary data files and scripts in a single folder and define this folder as your working directory, either through the menu or by using function setwd(). If you are uncertain of class(object_name). the class of an object, type Species Data: First Contact We can start data exploration, which first focuses on the community data (object spe created above). Verneaux used a semi-quantitative, species-specific, abundance scale ( 5) so that comparisons between species abundances make sense. However, species-specific codes cannot be understood as unbiased estimates of the true abundances (number or density of individuals) or biomasses at the sites. We first apply some basic R functions and draw a barplot (Fig. 2.1):
3 2.2 Data Exploration 11 4 Frequency Fig. 2.1 Barplot of abundance classes Abundance class 4 5
4 12 2 Exploratory Data Analysis Species Data: A Closer Look The commands above give an idea about the data structure. But codes and numbers are not very attractive or inspiring, so let us illustrate some features. We first create a map of the sites (Fig. 2.2): Site Locations Downstream 1 1 y coordinate (km) Upstream x coordinate (km) Fig. 2.2 Map of the 3 sampling sites along the Doubs River
5 2.2 Data Exploration 13 Now, the river looks more real, but where are the fish? To show the distribution and abundance of the four species used to characterize ecological zones in European rivers (Fig. 2.3), one can type: Hint Note the use of the cex argument in the plot() function: cex is used to define the size of an item in a graph. Here its value is a vector of the spe data frame, i.e. the abundances of a given species (e.g. cex=spe$tru). The result is a series of bubbles whose diameter at each site is proportional to the species abundance. Also, since the object spa contains only two variables x and y, the formula has been simplified by replacing the two first arguments for horizontal and vertical axes by the name of the data frame.
6 14 2 Exploratory Data Analysis x coordinate (km) Barbel Common bream 15 1 y coordinate (km) x coordinate (km) y coordinate (km) y coordinate (km) 2 Grayling y coordinate (km) Brown trout x coordinate (km) x coordinate (km) Fig. 2.3 Bubble maps of the abundance of four fish species At how many sites does each species occur? Calculate the relative frequencies of species (proportion of the number of sites) and plot histograms (Fig. 2.4):
7 2.2 Data Exploration 15 Hint Examine the use of the apply() function, applied here to the columns of the data frame spe. Note that the first part of the function call (spe > ) evaluates the values in the data frame to TRUE/ FALSE, and the number of TRUE cases per column is counted by summing. Species Relative Frequencies Number of species Number of species Species Occurrences Number of occurrences Frequency of occurrences (%) Fig. 2.4 Frequency histograms: species occurrences and relative frequencies in the 3 sites
8 16 2 Exploratory Data Analysis Now that we have seen at how many sites each species is present, we may want to know how many species are present at each site (species richness, Fig. 2.5): Hint Observe the use of the type="s" argument of the plot() function to draw steps between values.
9 2.2 Data Exploration 17 Species Richness vs. Upstream-Downstream Gradient y coordinate (km) Species richness Map of Species Richness Positions of sites along the river x coordinate (km) Fig. 2.5 Species richness along the river Finally, one can easily compute classical diversity indices from the data. Let us do it with the function diversity() of the vegan package. Hint Note the special use of function rowsums() for the computation of species richness N. Normally, rowsums(array) computes the sums of the rows in that array. Here, argument spe > calls for the sum of the cases where the value is greater than.
10 18 2 Exploratory Data Analysis Hill s numbers (N), which are all expressed in the same units, and ratios (E) derived from these numbers, can be used to compute diversity indices instead of popular formulae for Shannon entropy (H) and Pielou evenness (J). Note that there are other ways of estimating diversity while taking into account unobserved species (e.g. Chao and Shen 23) Species Data Transformation There are instances where one needs to transform the data prior to analysis. The main reasons are given below with examples of transformations: Make descriptors that have been measured in different units comparable (ranging, standardization to z-scores, i.e. centring and reduction, also called scaling) Make the variables normal and stabilize their variances (e.g. square root, fourth root, log transformations) Make the relationships among variables linear (e.g. log transformation of response variable if the relationship is exponential) Modify the weights of the variables or objects (e.g. give the same length (or norm) to all object vectors) Code categorical variables into dummy binary variables or Helmert contrasts Species abundances are dimensionally homogenous (expressed in the same p hysical units), quantitative (count, density, cover, biovolume, biomass, frequency, etc.) or semi-quantitative (classes) variables and restricted to positive or null values (zero meaning absence). For these, simple transformations may be used to reduce the importance of observations with very high values: sqrt() (square root), sqrt(sqrt()) (fourth root), or log1p() (natural logarithm of abundance + 1 to keep absence as zero) are commonly applied R functions. In extreme cases, to give the same weight to all positive abundances irrespective of their values, the data can be transformed to binary 1- form (presence absence). The decostand() function of the vegan package provides many options for common standardization of ecological data. In this function, standardization, as contrasted with simple transformation (such as square root, log or presence absence), means that the values are not transformed individually but relative to other values in the data table. Standardization can be done relative to sites (site profiles), species (species profiles), or both (double profiles), depending on the focus of the analysis. Here are some examples illustrated by boxplots (Fig. 2.6):
11 2.2 Data Exploration 19
12 2 2 Exploratory Data Analysis
13 2.2 Data Exploration Hint Take a look at the line: norm <- function(x) sqrt(x%*%x). It is an example of a small function built on the fly to fill a gap in the standard R packages: this function computes the norm (length) of a vector using a matrix algebraic form of Pythagora s theorem. For more matrix algebra, visit the Code It Yourself corners. 21
14 22 2 Exploratory Data Analysis Simple transformation Standardization by species raw data sqrt log max Standardization by sites total Double standardization Hellinger total norm Chi-square Wisconsin Fig. 2.6 Boxplots of transformed abundances of a common species, Nemacheilus barbatulus (stone loach) Another way to compare the effects of transformations on species profiles is to plot them along the river course:
15 2.2 Data Exploration 23 The Code It Yourself corner #1 Write a function to compute the Shannon Weaver entropy for a site vector containing species abundances. The formula is: H = [pi log( pi )] where pi = ni / N and ni = abundance of species i and N = total abundance of all species. After that, display the code of vegan s function diversity() to see how it has been coded among other indices by Jari Oksanen and Bob O Hara. Nice and compact, isn t it?
16 24 2 Exploratory Data Analysis Environmental Data Now that we are acquainted with the species data, let us turn to the environmental data (object env). First, go back to Sect and apply the basic functions presented there to env. While examining the summary(), note how the variables differ from the species data in values and spatial distributions. Draw maps of some of the environmental variables, first in the form of bubble maps (Fig. 2.7): Hint See how the cex argument is used to make the size of the bubbles comparable among plots. Play with these values to see the changes in the graphical output.
17 2.2 Data Exploration 25 Discharge y y Altitude x x Oxygen Nitrate y y x x Fig. 2.7 Bubble maps of environmental variables Now, examine the variation of some descriptors along the stream (Fig. 2.8):
18 26 2 Exploratory Data Analysis Fig. 2.8 Line plots of environmental variables To explore graphically the bivariate relationships among the environmental v ariables, we can use the powerful pairs() graphical function, which draws a matrix of scatter plots (Fig. 2.9).
19 2.2 Data Exploration 27 Moreover, we can add a LOWESS smoother to each bivariate plot and draw histograms in the diagonal plots, showing the frequency distribution of each variable, using external functions of the panelutils.r script. Hint Each scatterplot shows the relationship between two variables identified on the diagonal. The abscissa of the scatterplot is the variable above or under it, and the ordinate is the variable to its left or right. Simple transformations, such as the log transformation, can be used to improve the distributions of some variables (make it closer to the normal distribution). Furthermore, because environmental variables are dimensionally heterogeneous (expressed in different units and scales), many statistical analyses require their standardization to zero mean and unit variance. These centred and scaled variables are called z-scores. We can now illustrate transformations and standardization with our example data (Fig. 2.1).
20 28 2 Exploratory Data Analysis Bivariate Plots with Histograms and Smooth Curves das alt 3 pen deb ph 4 9 dur 2 4 pho 3 6 nit. 1.5 amm dbo Fig. 2.9 Scatter plots between all pairs of environmental variables with LOWESS smoothers oxy
21 2.2 Data Exploration 29 Hint Normality of a vector can be tested by using the Shapiro Wilk test, available through function shapiro.test(). Histogram of ln(env$pen) Frequency Frequency 3 Histogram of env$pen env$pen log(env$pen) Boxplot of ln(env$pen) log(env$pen) 1 env$pen Boxplot of env$pen Fig. 2.1 Histograms and boxplots of the untransformed (left) and log-transformed pen variable (slope)
22 3 2 Exploratory Data Analysis 2.3 Conclusion The tools presented in this chapter allow researchers to obtain a general impression of their data. Although you see much more elaborate analyses in the next chapters, keep in mind that a first exploratory look at the data can tell much about them. Information about simple parameters and distributions of variables is important to consider in order to choose more advanced analyses correctly. Graphical representations like bubble maps are useful to reveal how the variables are spatially organized; they may help generate hypotheses about the processes acting behind the scene. Boxplots and simple statistics may be necessary to reveal unusual or aberrant values. EDA is often neglected by people who are eager to jump to more sophisticated analyses. We hope to have convinced you that it should have an important place in the toolbox of ecologists.
23
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
Diagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
Data Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
Exploratory Data Analysis
Exploratory Data Analysis Johannes Schauer [email protected] Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA
Paper 156-2010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model
Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.
1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
TIPS FOR DOING STATISTICS IN EXCEL
TIPS FOR DOING STATISTICS IN EXCEL Before you begin, make sure that you have the DATA ANALYSIS pack running on your machine. It comes with Excel. Here s how to check if you have it, and what to do if you
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Algebra I Vocabulary Cards
Algebra I Vocabulary Cards Table of Contents Expressions and Operations Natural Numbers Whole Numbers Integers Rational Numbers Irrational Numbers Real Numbers Absolute Value Order of Operations Expression
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
Mini-project in TSRT04: Cell Phone Coverage
Mini-project in TSRT04: Cell hone Coverage 19 August 2015 1 roblem Formulation According to the study Swedes and Internet 2013 (Stiftelsen för Internetinfrastruktur), 99% of all Swedes in the age 12-45
STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.
STATGRAPHICS Online Statistical Analysis and Data Visualization System Revised 6/21/2012 Copyright 2012 by StatPoint Technologies, Inc. All rights reserved. Table of Contents Introduction... 1 Chapter
CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)
Statgraphics Centurion XVII (currently in beta test) is a major upgrade to Statpoint's flagship data analysis and visualization product. It contains 32 new statistical procedures and significant upgrades
Algebra 1 Course Information
Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through
Common Core Unit Summary Grades 6 to 8
Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations
Straightening Data in a Scatterplot Selecting a Good Re-Expression Model
Straightening Data in a Scatterplot Selecting a Good Re-Expression What Is All This Stuff? Here s what is included: Page 3: Graphs of the three main patterns of data points that the student is likely to
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures
Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the
IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Geostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]
MTH 140 Statistics Videos
MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
The Dummy s Guide to Data Analysis Using SPSS
The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests
KaleidaGraph Quick Start Guide
KaleidaGraph Quick Start Guide This document is a hands-on guide that walks you through the use of KaleidaGraph. You will probably want to print this guide and then start your exploration of the product.
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
Big Ideas in Mathematics
Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards
STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables
Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random
Module 5: Statistical Analysis
Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Exploratory Data Analysis for Ecological Modelling and Decision Support
Exploratory Data Analysis for Ecological Modelling and Decision Support Gennady Andrienko & Natalia Andrienko Fraunhofer Institute AIS Sankt Augustin Germany http://www.ais.fraunhofer.de/and 5th ECEM conference,
Indiana State Core Curriculum Standards updated 2009 Algebra I
Indiana State Core Curriculum Standards updated 2009 Algebra I Strand Description Boardworks High School Algebra presentations Operations With Real Numbers Linear Equations and A1.1 Students simplify and
SPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
Getting started with qplot
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................
Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015
Principles of Data Visualization for Exploratory Data Analysis Renee M. P. Teate SYS 6023 Cognitive Systems Engineering April 28, 2015 Introduction Exploratory Data Analysis (EDA) is the phase of analysis
12: Analysis of Variance. Introduction
1: Analysis of Variance Introduction EDA Hypothesis Test Introduction In Chapter 8 and again in Chapter 11 we compared means from two independent groups. In this chapter we extend the procedure to consider
January 26, 2009 The Faculty Center for Teaching and Learning
THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i
CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS
TABLES, CHARTS, AND GRAPHS / 75 CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS Tables, charts, and graphs are frequently used in statistics to visually communicate data. Such illustrations are also a frequent
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller Most of you want to use R to analyze data. However, while R does have a data editor, other programs such as excel are often better
Module 3: Correlation and Covariance
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
Scatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II
Part II covers diagnostic evaluations of historical facility data for checking key assumptions implicit in the recommended statistical tests and for making appropriate adjustments to the data (e.g., consideration
Bill Burton Albert Einstein College of Medicine [email protected] April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1
Bill Burton Albert Einstein College of Medicine [email protected] April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce
Simple Linear Regression, Scatterplots, and Bivariate Correlation
1 Simple Linear Regression, Scatterplots, and Bivariate Correlation This section covers procedures for testing the association between two continuous variables using the SPSS Regression and Correlate analyses.
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs
Using Excel Jeffrey L. Rummel Emory University Goizueta Business School BBA Seminar Jeffrey L. Rummel BBA Seminar 1 / 54 Excel Calculations of Descriptive Statistics Single Variable Graphs Relationships
How To Write A Data Analysis
Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 [email protected] 1. Descriptive Statistics Statistics
determining relationships among the explanatory variables, and
Chapter 4 Exploratory Data Analysis A first look at the data. As mentioned in Chapter 1, exploratory data analysis or EDA is a critical first step in analyzing the data from an experiment. Here are the
Exploratory Data Analysis
Exploratory Data Analysis Learning Objectives: 1. After completion of this module, the student will be able to explore data graphically in Excel using histogram boxplot bar chart scatter plot 2. After
Module 2: Introduction to Quantitative Data Analysis
Module 2: Introduction to Quantitative Data Analysis Contents Antony Fielding 1 University of Birmingham & Centre for Multilevel Modelling Rebecca Pillinger Centre for Multilevel Modelling Introduction...
Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard
Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express
Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.
Chapter 1 Vocabulary identity - A statement that equates two equivalent expressions. verbal model- A word equation that represents a real-life problem. algebraic expression - An expression with variables.
R with Rcmdr: BASIC INSTRUCTIONS
R with Rcmdr: BASIC INSTRUCTIONS Contents 1 RUNNING & INSTALLATION R UNDER WINDOWS 2 1.1 Running R and Rcmdr from CD........................................ 2 1.2 Installing from CD...............................................
Microsoft Excel. Qi Wei
Microsoft Excel Qi Wei Excel (Microsoft Office Excel) is a spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot
Didacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
GRADES 7, 8, AND 9 BIG IDEAS
Table 1: Strand A: BIG IDEAS: MATH: NUMBER Introduce perfect squares, square roots, and all applications Introduce rational numbers (positive and negative) Introduce the meaning of negative exponents for
with functions, expressions and equations which follow in units 3 and 4.
Grade 8 Overview View unit yearlong overview here The unit design was created in line with the areas of focus for grade 8 Mathematics as identified by the Common Core State Standards and the PARCC Model
Directions for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. Lab 2 - June, 2008 1 jointdata objects To analyse longitudinal data
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA
VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA Csilla Csendes University of Miskolc, Hungary Department of Applied Mathematics ICAM 2010 Probability density functions A random variable X has density
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Prentice Hall Mathematics: Algebra 2 2007 Correlated to: Utah Core Curriculum for Math, Intermediate Algebra (Secondary)
Core Standards of the Course Standard 1 Students will acquire number sense and perform operations with real and complex numbers. Objective 1.1 Compute fluently and make reasonable estimates. 1. Simplify
Package MDM. February 19, 2015
Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package
SPSS Manual for Introductory Applied Statistics: A Variable Approach
SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
Description. Textbook. Grading. Objective
EC151.02 Statistics for Business and Economics (MWF 8:00-8:50) Instructor: Chiu Yu Ko Office: 462D, 21 Campenalla Way Phone: 2-6093 Email: [email protected] Office Hours: by appointment Description This course
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
There are six different windows that can be opened when using SPSS. The following will give a description of each of them.
SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet
COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode
Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data
