The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue.

Size: px
Start display at page:

Download "The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue."

Transcription

1 More Principal Components Summary Principal Components (PCs) are associated with the eigenvectors of either the covariance or correlation matrix of the data. The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue. the ith eigenvalue measure the variance in the direction of the ith principal component. the ratio of the eigenvalue to the sum of the eigenvalues is the proportion of variation explained by the ith PC. Normally, we re more interested in the cumulative porportion of variationce explained by the first q PCs In practice, the first few PCs do the lion s share of explaining the variance. Eventually, the law of diminishing returns clicks in, and adding an additional PC brings very little improvement in the amount of variation explained. For this reason, PCs are often used as a dimension reduction technique: keep the first few PCs, discard the rest. How many to keep is up to you. The scree plot (see below) is sometimes helpful. In practice, people often like to use only the first two or three because they are easily visualized. If the first two capture most of the variation, then you can use a two-dimensional scatterplot to visualize a p-dimension data set. Features The first PC is chosen so that it is aligned with the direction of maximum variance. The second is chosen to be independent of the first. And so on. The result is that the PCs are statistically independent of each other. If the variables are normally distributed, then their cloud is a hyper-ellipse, and the PCs run along the axes of the ellipse. The predicted values, also called scores, are found by mutiplying each point by the eigenvector. This is the same as projecting each point onto the corresponding PC. The scores are, therefore, a linear combination of the initial variables. The eigenvectors give the recipe for creating scores: take this much of the first variable plus this much of the second, etc. The components of the eigenvectors are sometimes called the loadings. Correlation or Covariance?

2 If one variable has much larger variance than the others, then it will tend to dominate the first principal component. Sometimes this variation is just an artifact of the units chosen. For example, if we have measured the heights of four people in units of feet, we might see 5.8, 5.3, 5.9, 6.0 which has a variance of.0967 squared feet. Converting the same list to inches, however, has a variance of squared-inches. A much bigger value in absolute terms. When dealing with such situations (in which variables differ drastically in variance or in which variables are measured in different units), it is best to use the correlation matrix. This is equivalent to first standardizing each observation. That is, if X represents the humidity reading at day i, then replace it with (x - xbar)/s where xbar is the average humidity of all days, and s is the standard deviation. Applied Principal Components Principal Component Analysis is an exploratory technique useful for finding patterns or structure in high-dimensional data sets. Two immediate uses are dimension reduction and collinearity-elimination. But it is also useful as a more general tool for understanding the structure of the data. It is used in regression, sometimes, to solve the problem of collinearity. If a collection of p variables is collinear, then transforming them to p PCs produces a new set of variables that are statistical independent (as the regression model requests). Unfortunately, these new variables are linear combinations of the old, and might have lost their interpretatbility. Sometimes you get lucky, though, and this linear combination has physical meaning. For example, I ve seen an example in which a biologist wished to compare squirrels living in different habitats. She was wondering if there were measurable physical characteristics that differed. So she measured height, width, length, ear height, etc. from both groups. It turned out that the first principal component was a linear combination that gave strong weights to width, length, height, and small weights to the other variables. She interpreted this to be a size index. The second principal component strongly weighted variables which she could interpret to be a shape index. The remaining PCs were discarded, and she now had two indices with which to compare the populations. A PC analysis might include these steps: Examine the cumulative explained variation and decide how many PCs to keep. Examine the loadings of the first few PCs to determine if they are interpretable. Are some variables much more important than others? Examine the relationship of variables to each other. Which are most alike? Examine data points with respect to their PCs. Do they cluster? Are new trends apparent?

3 Graphical Tools The screeplot is, quite simply, a plot of the variances (the eigenvalues) on the y axis against the integers 1, 2,...p on the horizontal axis. The purpose is to visualize how quickly the additional variation falls off. Often, the screeplot will descend quickly and then level out. Many people will drop PCs that occur after this kink at which the graph levels off. For some strange reason, R doesn t draw the right plot, but instead draws bar-graphs. This is a little more awkward, but the essential features remian. The bi-plot is a much more useful tool. It is a two-dimensional projection of the data onto the first two PCs. Often the points are labelled with a meaningful identifer to aid in picking up trends. On the same plot, we then include vectors that represent the variables. These are plotted using the loadings of the first two PCs. Easier demonstrated then explained. Example Suppose we have three variables, Height, Weight, Hatsize. And suppose for our data set consiting of measurements on 100 men and women, we get these eigenvectors: PC1 PC2 PC3 Height Weight Hatsize The first person in our data set, say, has these measuresments: Height = 5.2ft, Weight = 145 pounds, Hatsize = 6.5. (Assume we ve somewhat incorrectly used the covariance matrix). Then that person gets a point plotted at these coordinates: PC1 =.33* * * 6.5 = (on the horizontal axis) PC2 = -.2 * * * 6.5 = (on the vertical axis). And so on for each of the 100 observations. We might label these points M and F on the plot to see if there are differences between the men and the women. The variable Height gets a vector drawn pointing in the direction from the origin (0,0) to the point (.33, -.20). Weight gets a vector drawn from (0,0) to ( ), and Hatsize from (0,0) to (.10, 0). The lengths of these vectors are then scaled so that they are proportional to the variance. The biplot has several useful features: Points that are near each other are observations that had similar scores.

4 The cosine of the angles between vectors is equal to the correlation between those variables. Hence vectors pointing in the same direction are perfectly correlated, and those at right angles are uncorrelated. The length of the difference vector between any two vectors is equal to the sampling variance of the difference of those two variables. Using R R has several ways of doing principal component analysis. The eigen function returns eigen values and eigenvectors. The prcomp function is a numerically stable routine that returns a prcomp object that contains the square-root of the eigenvalues ( sdev ), the eigenvectors ( rotation ), and the scores ( x ). The princomp function is slightly less stable, but has more features. It returns a princomp object that contains the square-root of the eigenvalues ( sdev ), the eigenvectors ( loadings ), the means for each variable ( center ) and the scores ( scores ), as well as some other things. Typing summary(princomp) or summary(prcomp) will return the percent of variation explained by each PC. Typing plot(princomp) or plot(prcomp) will return a scree-plot. Typing biplog(princomp) returns a biplot. This does not work for prcomp. Example1: Violent crime in the US The dataset USArrests that comes with R contains data contains information on the number of arrests per 100,000 residents in each of the 50 US states in 1973 for 3 types of crimes. It also includes the percent of the population living in urban areas. For example: Murder Assault UrbanPop Rape Alabama Alaska California From which we learn that Alabuma had an arrest rate of for Murder, while California had an arrest rate of So, if you don t want to be murdered, California is perhaps the safer state. But of course, less safe for Assault and Rape. What is the best state overall for safety with respect to these measures? Is there a relationship between the size of the urban population and crime?

5 Because the variables are very different in scale, we ll base our analysis on the correlation matrix. (We do this by including an option cor = TRUE in the call.) out <- princomp(usarrests, cor = TRUE) > summary(out) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation Proportion of Variance Cumulative Proportion The first three PCs explain almost all of the variation. There s certainly little to be gained by adding the fourth. If we stick with only the first two, we get an adequate (maybe) amount of the structure preserved. > plot(out) This screeplot shows that there s not much to be gained as we move from Comp3 to 4. Let s examine the loadings: > out$loadings

6 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder Assault UrbanPop Rape The first principal component is an average of the three types of crime, with a little bit more added in for the percent of the population living in cities. States with high crime rates in all three categories will score a high negative score here, and it will be even higher if a lot of the population lives in cities. Put differently, states with slightly lower crime rates can still score high here if they have large urban populations. The second principal component places highest weights on Urbanpop and Murder, and in fact on the difference between them. States with high urban populations and low murder will score big negative values here. States with low urban populations and high murder will get big positive values. This is mitigated, somewhat, by the Rape and Assault rates. The third PC is very difficult to interpret. Keep in mind that since the analysis was done on the correlation matrix, terms like low and high mean high with respect to average (and measured in standard units.) The biplot helps summarize this: > biplot(out)

7 First, note that Murder, Assault and Rape are highly correlated with each other, but there is low correlation with Urbanpopulation! This means that states with a higher than average percentage of residents in city tend to not make arrests at a higher than average rate. You can see, now, how the variables contribute to the PCs. The three crime variables are nearly paralell to PC1, and so contribute heavily. But UrbanPop is nearly orthogonal, and so scoring high on it has a neglible effect on your placement along PC1. States in the center are average on all variables. But look at California: it is extreme on both PCs, and in fact there are few states like it. It has a high urban population, and high crime rates. It s very negative score on PC1 means that its overallcrime-rate is high, and its very negative score on PC2 means that, for its size, it has a low murder arrest rate. (It s the highest in terms of urban population, but only moderately high in terms of murder.) Generally, states on the right-hand side of the graph have low overall crime rates. States in the upper-half have high murder with respect to their low urban populations. To choose states arbitratily, South and Norht Dakota are veyr similar in crime and urban popuulation, as our new Mexico and Michigan. Surprisingly, New Jersey and Hawaii are also similar.

8 Example 2: Ozone data Returning to our Ozone data set. We again do our analysis on the correlation matrix. > out <- princomp(o2, cor=t) > summary(out) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation Proportion of Variance Cumulative Proportion We have to go all the way up to the 6th component before explaining a high percent of the variation, although the law of diminishing returns kicks in around the fourth or fifth component. It would be surprising if we could interpret the PCs in a meaningful way based on their loadings. You are welcome to try. I could find nothing meaningful. The biplot, however, reveals an interesting structure that we have not yet seen: The data were collected daily, and the numbers thus represent the day on which the observation

9 was made. Recall that points near each other had similar scores. Now, notice that there are many sequences clumped together. For example, in the upper right corner you can see 111, 112, 114, 115, 116 nearby, which means that those give days -- all occuring in the same week, had similar weather. This means that are observations were not, probably, independent as we assumed they were when we did the regression. The vectors tell us that humidity, windspeed and pressure are correlated with each other. Visibility and ozone are anti-correlated. The group of humidity windspeed and pressure are almost independent of height and inversionht. (Some of the vectors do not have labels, but I m looking at the loadings to figure out which is hwich.) If you want to try to use these PCs to reduce the collinearity, you will find that your fit is not much improved. But examining the biplot quickly gave us insight into the relationship of the variables and also pointed out the dependence of subsequent days.

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

MEASURES OF VARIATION

MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared jn2@ecs.soton.ac.uk Relationships between variables So far we have looked at ways of characterizing the distribution

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 May 2008 Introduction Suppose we had measured two variables, length and width, and

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Introduction to Principal Component Analysis: Stock Market Values

Introduction to Principal Component Analysis: Stock Market Values Chapter 10 Introduction to Principal Component Analysis: Stock Market Values The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from

More information

Scatter Plots with Error Bars

Scatter Plots with Error Bars Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

More information

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

More information

PRINCIPAL COMPONENT ANALYSIS

PRINCIPAL COMPONENT ANALYSIS 1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Exploratory Factor Analysis

Exploratory Factor Analysis Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model.

More information

Relationships Between Two Variables: Scatterplots and Correlation

Relationships Between Two Variables: Scatterplots and Correlation Relationships Between Two Variables: Scatterplots and Correlation Example: Consider the population of cars manufactured in the U.S. What is the relationship (1) between engine size and horsepower? (2)

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013 Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Principal components analysis

Principal components analysis CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Factor Analysis. Sample StatFolio: factor analysis.sgp

Factor Analysis. Sample StatFolio: factor analysis.sgp STATGRAPHICS Rev. 1/10/005 Factor Analysis Summary The Factor Analysis procedure is designed to extract m common factors from a set of p quantitative variables X. In many situations, a small number of

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

The Normal Distribution

The Normal Distribution Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution

More information

Common factor analysis

Common factor analysis Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Principle Component Analysis: A statistical technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables.

More information

MA107 Precalculus Algebra Exam 2 Review Solutions

MA107 Precalculus Algebra Exam 2 Review Solutions MA107 Precalculus Algebra Exam 2 Review Solutions February 24, 2008 1. The following demand equation models the number of units sold, x, of a product as a function of price, p. x = 4p + 200 a. Please write

More information

2 Describing, Exploring, and

2 Describing, Exploring, and 2 Describing, Exploring, and Comparing Data This chapter introduces the graphical plotting and summary statistics capabilities of the TI- 83 Plus. First row keys like \ R (67$73/276 are used to obtain

More information

Describing Relationships between Two Variables

Describing Relationships between Two Variables Describing Relationships between Two Variables Up until now, we have dealt, for the most part, with just one variable at a time. This variable, when measured on many different subjects or objects, took

More information

Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Factor Analysis. Advanced Financial Accounting II Åbo Akademi School of Business

Factor Analysis. Advanced Financial Accounting II Åbo Akademi School of Business Factor Analysis Advanced Financial Accounting II Åbo Akademi School of Business Factor analysis A statistical method used to describe variability among observed variables in terms of fewer unobserved variables

More information

FACTOR ANALYSIS NASC

FACTOR ANALYSIS NASC FACTOR ANALYSIS NASC Factor Analysis A data reduction technique designed to represent a wide range of attributes on a smaller number of dimensions. Aim is to identify groups of variables which are relatively

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such

More information

CURVE FITTING LEAST SQUARES APPROXIMATION

CURVE FITTING LEAST SQUARES APPROXIMATION CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship

More information

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

CHI-SQUARE: TESTING FOR GOODNESS OF FIT CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity

More information

Time series Forecasting using Holt-Winters Exponential Smoothing

Time series Forecasting using Holt-Winters Exponential Smoothing Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables. FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have

More information

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis

More information

Exploratory Factor Analysis of Demographic Characteristics of Antenatal Clinic Attendees and their Association with HIV Risk

Exploratory Factor Analysis of Demographic Characteristics of Antenatal Clinic Attendees and their Association with HIV Risk Doi:10.5901/mjss.2014.v5n20p303 Abstract Exploratory Factor Analysis of Demographic Characteristics of Antenatal Clinic Attendees and their Association with HIV Risk Wilbert Sibanda Philip D. Pretorius

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Probability Distributions

Probability Distributions CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

More information

T-test & factor analysis

T-test & factor analysis Parametric tests T-test & factor analysis Better than non parametric tests Stringent assumptions More strings attached Assumes population distribution of sample is normal Major problem Alternatives Continue

More information

Section V.3: Dot Product

Section V.3: Dot Product Section V.3: Dot Product Introduction So far we have looked at operations on a single vector. There are a number of ways to combine two vectors. Vector addition and subtraction will not be covered here,

More information

Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

. 58 58 60 62 64 66 68 70 72 74 76 78 Father s height (inches)

. 58 58 60 62 64 66 68 70 72 74 76 78 Father s height (inches) PEARSON S FATHER-SON DATA The following scatter diagram shows the heights of 1,0 fathers and their full-grown sons, in England, circa 1900 There is one dot for each father-son pair Heights of fathers and

More information

Reliability Analysis

Reliability Analysis Measures of Reliability Reliability Analysis Reliability: the fact that a scale should consistently reflect the construct it is measuring. One way to think of reliability is that other things being equal,

More information

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R) DSC 2003 Working Papers (Draft Versions) http://www.ci.tuwien.ac.at/conferences/dsc-2003/ An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R) Ernst

More information

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540

Years after 2000. US Student to Teacher Ratio 0 16.048 1 15.893 2 15.900 3 15.900 4 15.800 5 15.657 6 15.540 To complete this technology assignment, you should already have created a scatter plot for your data on your calculator and/or in Excel. You could do this with any two columns of data, but for demonstration

More information

PLOTTING DATA AND INTERPRETING GRAPHS

PLOTTING DATA AND INTERPRETING GRAPHS PLOTTING DATA AND INTERPRETING GRAPHS Fundamentals of Graphing One of the most important sets of skills in science and mathematics is the ability to construct graphs and to interpret the information they

More information

[1] Diagonal factorization

[1] Diagonal factorization 8.03 LA.6: Diagonalization and Orthogonal Matrices [ Diagonal factorization [2 Solving systems of first order differential equations [3 Symmetric and Orthonormal Matrices [ Diagonal factorization Recall:

More information

Linear Programming Notes VII Sensitivity Analysis

Linear Programming Notes VII Sensitivity Analysis Linear Programming Notes VII Sensitivity Analysis 1 Introduction When you use a mathematical model to describe reality you must make approximations. The world is more complicated than the kinds of optimization

More information

by the matrix A results in a vector which is a reflection of the given

by the matrix A results in a vector which is a reflection of the given Eigenvalues & Eigenvectors Example Suppose Then So, geometrically, multiplying a vector in by the matrix A results in a vector which is a reflection of the given vector about the y-axis We observe that

More information

MATH 110 Landscape Horticulture Worksheet #4

MATH 110 Landscape Horticulture Worksheet #4 MATH 110 Landscape Horticulture Worksheet #4 Ratios The math name for a fraction is ratio. It is just a comparison of one quantity with another quantity that is similar. As a Landscape Horticulturist,

More information

A Demonstration of Hierarchical Clustering

A Demonstration of Hierarchical Clustering Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised

More information

Five Ways to Solve Proportion Problems

Five Ways to Solve Proportion Problems Five Ways to Solve Proportion Problems Understanding ratios and using proportional thinking is the most important set of math concepts we teach in middle school. Ratios grow out of fractions and lead into

More information

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS. SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course is designed

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

More information

Determine If An Equation Represents a Function

Determine If An Equation Represents a Function Question : What is a linear function? The term linear function consists of two parts: linear and function. To understand what these terms mean together, we must first understand what a function is. The

More information

Statistics Chapter 2

Statistics Chapter 2 Statistics Chapter 2 Frequency Tables A frequency table organizes quantitative data. partitions data into classes (intervals). shows how many data values are in each class. Test Score Number of Students

More information

Multivariate Analysis

Multivariate Analysis Table Of Contents Multivariate Analysis... 1 Overview... 1 Principal Components... 2 Factor Analysis... 5 Cluster Observations... 12 Cluster Variables... 17 Cluster K-Means... 20 Discriminant Analysis...

More information

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Unit 9 Describing Relationships in Scatter Plots and Line Graphs Unit 9 Describing Relationships in Scatter Plots and Line Graphs Objectives: To construct and interpret a scatter plot or line graph for two quantitative variables To recognize linear relationships, non-linear

More information

Freehand Sketching. Sections

Freehand Sketching. Sections 3 Freehand Sketching Sections 3.1 Why Freehand Sketches? 3.2 Freehand Sketching Fundamentals 3.3 Basic Freehand Sketching 3.4 Advanced Freehand Sketching Key Terms Objectives Explain why freehand sketching

More information

A Brief Introduction to SPSS Factor Analysis

A Brief Introduction to SPSS Factor Analysis A Brief Introduction to SPSS Factor Analysis SPSS has a procedure that conducts exploratory factor analysis. Before launching into a step by step example of how to use this procedure, it is recommended

More information

GRADE SIX-CONTENT STANDARD #4 EXTENDED LESSON A Permission Granted. Making a Scale Drawing A.25

GRADE SIX-CONTENT STANDARD #4 EXTENDED LESSON A Permission Granted. Making a Scale Drawing A.25 GRADE SIX-CONTENT STANDARD #4 EXTENDED LESSON A Permission Granted Making a Scale Drawing Introduction Objective Students will create a detailed scale drawing. Context Students have used tools to measure

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

1 Shapes of Cubic Functions

1 Shapes of Cubic Functions MA 1165 - Lecture 05 1 1/26/09 1 Shapes of Cubic Functions A cubic function (a.k.a. a third-degree polynomial function) is one that can be written in the form f(x) = ax 3 + bx 2 + cx + d. (1) Quadratic

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

Updates to Graphing with Excel

Updates to Graphing with Excel Updates to Graphing with Excel NCC has recently upgraded to a new version of the Microsoft Office suite of programs. As such, many of the directions in the Biology Student Handbook for how to graph with

More information

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation Display and Summarize Correlation for Direction and Strength Properties of Correlation Regression Line Cengage

More information

MATH2210 Notebook 1 Fall Semester 2016/2017. 1 MATH2210 Notebook 1 3. 1.1 Solving Systems of Linear Equations... 3

MATH2210 Notebook 1 Fall Semester 2016/2017. 1 MATH2210 Notebook 1 3. 1.1 Solving Systems of Linear Equations... 3 MATH0 Notebook Fall Semester 06/07 prepared by Professor Jenny Baglivo c Copyright 009 07 by Jenny A. Baglivo. All Rights Reserved. Contents MATH0 Notebook 3. Solving Systems of Linear Equations........................

More information

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left.

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left. The verbal answers to all of the following questions should be memorized before completion of pre-algebra. Answers that are not memorized will hinder your ability to succeed in algebra 1. Number Basics

More information

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process Business Intelligence and Decision Making Professor Jason Chen The analytical hierarchy process (AHP) is a systematic procedure

More information

Measures of Central Tendency and Variability: Summarizing your Data for Others

Measures of Central Tendency and Variability: Summarizing your Data for Others Measures of Central Tendency and Variability: Summarizing your Data for Others 1 I. Measures of Central Tendency: -Allow us to summarize an entire data set with a single value (the midpoint). 1. Mode :

More information

If you know exactly how you want your business forms to look and don t mind

If you know exactly how you want your business forms to look and don t mind appendix e Advanced Form Customization If you know exactly how you want your business forms to look and don t mind detail work, you can configure QuickBooks forms however you want. With QuickBooks Layout

More information

Structural Axial, Shear and Bending Moments

Structural Axial, Shear and Bending Moments Structural Axial, Shear and Bending Moments Positive Internal Forces Acting Recall from mechanics of materials that the internal forces P (generic axial), V (shear) and M (moment) represent resultants

More information

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. Unit 1 Number Sense In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. BLM Three Types of Percent Problems (p L-34) is a summary BLM for the material

More information

6 3 The Standard Normal Distribution

6 3 The Standard Normal Distribution 290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables Scatterplot; Roles of Variables 3 Features of Relationship Correlation Regression Definition Scatterplot displays relationship

More information

STA 4107/5107. Chapter 3

STA 4107/5107. Chapter 3 STA 4107/5107 Chapter 3 Factor Analysis 1 Key Terms Please review and learn these terms. 2 What is Factor Analysis? Factor analysis is an interdependence technique (see chapter 1) that primarily uses metric

More information