Descriptive Statistics



Similar documents
Linear Regression. use waist

Module 2 Basic Data Management, Graphs, and Log-Files

Getting started with the Stata

Relationships Between Two Variables: Scatterplots and Correlation

Exercise 1.12 (Pg )

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

SPSS Tutorial, Feb. 7, 2003 Prof. Scott Allard

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

Describing, Exploring, and Comparing Data

Scatter Plots with Error Bars

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Diagrams and Graphs of Statistical Data

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Module 3: Correlation and Covariance

Data analysis and regression in Stata

Data Analysis Tools. Tools for Summarizing Data

MBA 611 STATISTICS AND QUANTITATIVE METHODS

A Short Guide to R with RStudio

5 Correlation and Data Exploration

Name: Date: Use the following to answer questions 2-3:

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Using SPSS, Chapter 2: Descriptive Statistics

Summarizing and Displaying Categorical Data

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

Curve Fitting in Microsoft Excel By William Lee

Exploratory Data Analysis with One and Two Variables

MetroBoston DataCommon Training

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Microsoft Excel. Qi Wei

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

Interpreting Data in Normal Distributions

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

Correlation and Regression

6 3 The Standard Normal Distribution

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

SHORT COURSE ON Stata SESSION ONE Getting Your Feet Wet with Stata

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Simple linear regression

Introduction Course in SPSS - Evening 1

MARS STUDENT IMAGING PROJECT

Absorbance Spectrophotometry: Analysis of FD&C Red Food Dye #40 Calibration Curve Procedure

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

The North Carolina Health Data Explorer

2. Creating Bar Graphs with Excel 2007

Appendix 2.1 Tabular and Graphical Methods Using Excel

Data exploration with Microsoft Excel: analysing more than one variable

Basics of STATA. 1 Data les. 2 Loading data into STATA

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Getting started in Excel

This activity will show you how to draw graphs of algebraic functions in Excel.

GeoGebra Statistics and Probability

Stata Tutorial. 1 Introduction. 1.1 Stata Windows and toolbar. Econometrics676, Spring2008

10. Comparing Means Using Repeated Measures ANOVA

Directions for using SPSS

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

Exploring Relationships between Highest Level of Education and Income using Corel Quattro Pro

Dealing with Data in Excel 2010

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Introduction to Exploratory Data Analysis

Generating ABI PRISM 7700 Standard Curve Plots in a Spreadsheet Program

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Introduction to SPSS 16.0

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Data Exploration Data Visualization

Univariate Regression

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

DesCartes (Combined) Subject: Mathematics Goal: Statistics and Probability

Calculator Notes for the TI-Nspire and TI-Nspire CAS

COMMON CORE STATE STANDARDS FOR

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

APPENDIX A Using Microsoft Excel for Error Analysis


Gerrit Stols

Gestation Period as a function of Lifespan

In this example, Mrs. Smith is looking to create graphs that represent the ethnic diversity of the 24 students in her 4 th grade class.

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Drawing a histogram using Excel

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Probability Distributions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

1. Go to your programs menu and click on Microsoft Excel.

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Scientific Graphing in Excel 2010

Lab 1: The metric system measurement of length and weight

Determine If An Equation Represents a Function

Transcription:

Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such as the mean, standard deviation and percentiles. Begin by starting a new STATA session. Recall from the previous tutorial that the first thing to do before starting your analysis is to open a log file. If you don t open a log file no results will be saved from your STATA session. After finishing your analysis, the log file can subsequently be edited and printed. To illustrate the commands we intend on covering in this tutorial, we will use a data set that can be found on the course webpage. In the command window type: use http://www.stat.columbia.edu/~martin/w1111/data/auto This command will read in a data set auto, which is one of the many data sets available to be downloaded from the STATA web page. This particular data set contains information, from 1978, on a number of variables used to describe 74 different car models. To learn more about the data set we can type the command describe in the Command window. This will provide a non-statistical description of the data. In the Results window, the following output appears:

According to this output, the data set consists of 74 observations on 12 variables (make, price, mpg, rep78, headroom, trunk, weight, length, turn, displacement, gear_ratio and foreign). Now, suppose we are interested in looking at the complete data set. To do this type list in the Command window. This leads to a fair lengthy amount of output. Suppose we are only really interested in the variable mpg. To only list this variable type: list mpg in the Command window. You should now find a list of the values for the variable mpg for the 74 observations. The command summarize provides a simple numerical summary of all the variables in the data set. It calculates the mean, standard deviation, minimum and maximum values. Type the command summarize in the Command window. This gives the following output:

If we are only interested in the summary statistics for the variable mpg and weight, type summarize mpg weight in the command window. This gives the following output: Including the option detail produces additional statistics including median, and various percentiles. The command summarize, detail gives this information for all the variables. If we are only interested in this information for the variables mpg and weight, type summarize mpg weight, detail You may have noticed that the variable foreign is a categorical variable. A car is given the value domestic if it is made in the USA and it is given the value foreign if it is made abroad. Suppose we want to study the mean of the variable mpg separately depending on whether or not the car is made in American. Including if in a command allows us to specify the subset of the data in which we want to apply the command. The commands given below give the summary statistics on mpg separately for domestic and foreign made cars. Adding the command foreign==0 specifies that we are only interested in the cars that are domestic (or not foreign), while the command foreign==1 specifies the cars that are foreign. summarize mpg if foreign == 0 summarize mpg if foreign == 1

Correlation Suppose we are interested in calculating the correlation between two variables in the data set. To find the correlation between mpg and weight type the command: correlate mpg weight This gives the following output: We see here that the correlation between mpg and weight is -0.8072. We also see that the correlation between mpg and mpg is equal to 1, as is the correlation between weight and weight. Either of these results should be particularly surprising. Graphics Graphics can be made either by using the drop down menu Graphics in the STATA GUI or by using the command line. If we decide to use the command line, we need to type a command such as graphtype varname(s)

The type of graph is specified by graphtype. There are several types of graphs: histograms, box plots and scatter plots. We will discuss how to make each of these types of plots below. Box plots We can make a box plot of the variable mpg using the command graph box mpg This creates the following graph in a specific graphics window: M ile a g e ( m p g ) 1 0 2 0 3 0 4 0 Alternatively, we can use the drop down menu Graphics and choose Easy graphs and thereafter Box plot; see Figure 1 for an illustration. A window will appear that asks you to state the name of the variable(s) you want to plot, see Figure 2. Type in the variable name mpg and click OK. An equivalent box plot to the one shown above will be created.

Figure 1 Figure 2

Histograms To make a histogram of the variable mpg type histogram mpg in the command window. The STATA Graph window will appear with a histogram of the relative frequency distribution of the variable mpg. Density 0.02.04.06.08.1 10 20 30 40 Mileage (mpg) You can specify the number of bins for your histogram. Try typing histogram mpg, bin(10) This command allows you to have a histogram with 10 bins. You can experiment with several values and notice how the shape of the histogram is affected by the choice of number of bins. Density 0.02.04.06.08.1 10 20 30 40 Mileage (mpg) Alternatively, we can use the drop down menu Graphics and choose Easy graphs and thereafter Histogram.

Scatter plots We use scatter plots for assessing the relationship between two variables. To make a scatter plot in STATA type scatter mpg weight in the command window. This produces a scatterplot with weight on the x-axis (horizontal) and mpg on the y-axis (vertical). Mileage (mpg) 10 20 30 40 2,000 3,000 4,000 5,000 Weight (lbs.) Alternatively, we can use the drop down menu Graphics and choose Easy graphs and thereafter Scatter plot. Saving a graph You can save a graph by clicking on the right button on your mouse and choosing the option Save graph. Another option is to go to the drop down menu File in upper left hand corner of the STATA window and press Save graph. It is important to note that your graphs will not automatically be saved to your log file, so you will have to save them separately.

Linear Regression To create a scatter plot of mpg and weight type scatter mpg weight Studying the plot, there appears to be a relatively strong negative association between the variables, so we decide to find the least-squares regression line. We can do this by typing the following two commands: regress mpg weight predict yhat, xb These commands generate a variable yhat that contains the predicted observations corresponding to the data. The first command gives rise to a fair amount of output which we will discuss more in detail later in the course. For now it is enough to look at the output which is circled. These are the values for a, b and r 2. Here a = 39.44 b = 0.006 r 2 = 0.6515 Hence, the least-square regression line is given by mpg ˆ = 39.44 0. 006weight. The fraction of the variability in mpg that is explained by the least squares line of mpg on weight is equal to 0.6515.

To plot the regression line together with the data type: scatter mpg weight line yhat weight This command essentially tells STATA to create a scatter plot of mpg against weight and superimpose the line given by yhat. This command gives the following output: Mileage (mpg)/linear prediction 10 20 30 40 2,000 3,000 4,000 5,000 Weight (lbs.) Mileage (mpg) Linear prediction We will discuss regression more in depth later in the course.

A guided example (a) Start STATA. If you are continuing a previous session it is a good idea to clear all the variables. You can do this by writing clear in the command window. (b) Create a log file named Assignment2.log. You can do this by writing in the command window. log using a:/assignment2.log (c) In this example we will use the same data set as described above. In the command window write, use http://www.stat.columbia.edu/~martin/w1111/data/auto (d) Calculate the mean price of the automobiles in the data set. You can do this by writing summarize price What was the mean price of automobiles in 1978? (e) Calculate the median price of the automobiles in the data set. You can do this by writing summarize price, detail What was the median price of automobiles in 1978? What does the difference between the mean and median price indicate about the shape of the distribution for the price? (f) Make a histogram over the price of cars. You can do this by using the following command, histogram price What shape does the histogram take? Is it symmetric? Skewed? Does the shape of the histogram coincide with what you guessed in (e)? Save the histogram.

(g) Make a scatter plot of the variables weight and length. You can use the command, scatter weight length Study the plot. Does there appear to be any association between the variables? Save the scatter plot. (h) What is the correlation between the weight and length. Use the command, correlate weight length (i) Fit a regression line for predicting the weight of a car using the length. Use the commands regress weight length predict yhat, xb What is the least squares regression line? What fraction of the variablity in weight is explained by the least square regression on length? Plot the regression line together with the scatter plot of weight and length. To do this type: Save the plot. scatter weight length line yhat length (j) Close the log file using the command log close Edit and print your log file and graphs. Hand them in together with your answers to the questions above.