Exploratory Data Analysis with One and Two Variables



Similar documents
Descriptive Statistics

Diagrams and Graphs of Statistical Data

Relationships Between Two Variables: Scatterplots and Correlation

Scatter Plots with Error Bars

Data exploration with Microsoft Excel: analysing more than one variable

Summarizing and Displaying Categorical Data

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Correlation and Regression

Exercise 1.12 (Pg )

TIPS FOR DOING STATISTICS IN EXCEL

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

MetroBoston DataCommon Training

Using SPSS, Chapter 2: Descriptive Statistics

COMMON CORE STATE STANDARDS FOR

MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS

Data Exploration Data Visualization

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

3. Data Analysis, Statistics, and Probability

Module 2 Using the Database

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

Lecture 1: Review and Exploratory Data Analysis (EDA)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Getting started in Excel

Data Analysis Tools. Tools for Summarizing Data

Lab 11. Simulations. The Concept

Module 2 Basic Data Management, Graphs, and Log-Files

5 Correlation and Data Exploration

Data Visualization Handbook

What is the purpose of this document? What is in the document? How do I send Feedback?

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Work Force Inventory System Dynamic Lab

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Lecture 2. Summarizing the Sample


Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

MTH 140 Statistics Videos

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

General instructions for the content of all StatTools assignments and the use of StatTools:

Introduction to Exploratory Data Analysis

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Exploratory data analysis (Chapter 2) Fall 2011

Tutorial for proteome data analysis using the Perseus software platform

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Organizing Your Approach to a Data Analysis

A full analysis example Multiple correlations Partial correlations

How To Run Statistical Tests in Excel

GeoGebra Statistics and Probability

Data exploration with Microsoft Excel: univariate analysis

Introduction to RStudio

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction

STC: Descriptive Statistics in Excel Running Descriptive and Correlational Analysis in Excel 2013

IBM SPSS Direct Marketing 23

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

Intermediate PowerPoint

Exploring Relationships between Highest Level of Education and Income using Corel Quattro Pro

Introduction Course in SPSS - Evening 1

Projects Involving Statistics (& SPSS)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Coins, Presidents, and Justices: Normal Distributions and z-scores

Data Mining and Visualization

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Data analysis and regression in Stata

CentralMass DataCommon

Problem of the Month Through the Grapevine

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Excel Tutorial. Bio 150B Excel Tutorial 1

Updates to Graphing with Excel

Describing, Exploring, and Comparing Data

Additional sources Compilation of sources:

Figure 1. An embedded chart on a worksheet.

Lecture 19: Chapter 8, Section 1 Sampling Distributions: Proportions

ZOINED RETAIL ANALYTICS. User Guide

Directions for using SPSS

Managing documents, files and folders

Fairfield Public Schools

IBM SPSS Direct Marketing 22

MULTIPLE REGRESSION EXAMPLE

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents

MARS STUDENT IMAGING PROJECT

R Commander Tutorial

Lab 1: The metric system measurement of length and weight

STAT355 - Probability & Statistics

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

UCL Depthmap 7: Data Analysis

Lab 7. Exploratory Data Analysis

Data Preparation and Statistical Displays

Intro to Statistics 8 Curriculum

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Foundation of Quantitative Data Analysis

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

TI-Inspire manual 1. I n str uctions. Ti-Inspire for statistics. General Introduction

Transcription:

Exploratory Data Analysis with One and Two Variables Instructions for Lab #2 Statistics 111- Probability and Statistical Inference Lab Objective To explore data with histograms and scatter plots. Review A few highlights to review from the previous lab: 1. The Working Directory Itisalwaysimportanttoknowwherethefilesthatyouareusingaresavedonthecomputer. This is so both you and Stata can access the correct files. Let s see how this would work for this problem. In the next section is the link to the dataset for this problem set. Download the data and save/put it in the folder you are going to work in(e.g., C:\Users\John\Documents\Stats111\ lab2). NowyouneedtodirectStatatothisfolder,whichwillbethefolderyouwillbeworking from. Thereare twowaystoaccomplish this. (a) Use the menu bar and navigate to File -> Change WorkingDirectory. This will pull up awindowthat youcan thenusetofindthefolderwhereyousavedthedata. (b) Issue a command to change the working directory by typing cd followed by the filepath to your file,(e.g., cd C:\Users\John\Documents\Stats111\lab2). Typing pwd into the command line well tell you the current directory and can be used to verify thatyouare in therightplace. 2. Do-files and comments AllthecommandsyouenterintotheCommandLineforthelabcan(andshould)beputinto a do-file to allow replication and access at a later date. You should save a do-file in the folder where the data is. Let s go through another example. Download this sample do-file and save/put it in the same folder where you saved the data above. Now go back to Stata, make sure you are in the correct working directory, and run thisfile througheach of thefollowing threeways:

Stats 111 2 Exploratory Data Analysis (a) Type do example.do in the command line. (b) Navigate to File -> Do... and manually select the file. (c) Open up the do-file editor(ctrl+8; or Windows-> Do-file Edtior), open the do-file in theeditor,andexecutethejob (Tools-> Executeorshortcutkeyonthetoolbar). Notethewordsonlines7-8ofthis do-file. Theyaregreenandareprecededbyanasterisk (*). These are comments and are very useful in do-files. Stata ignores these lines when executing a do-file. Nowyoucan create acopyofthis do-file andbuild off it fortheproblem set. 3. Fonts Differentfontsgenerallymeanveryspecificthingsinthelabs. Thisistohelpyoumoreeasily distinguish Stata code from the rest of the text: Typewriter text refers to Stata commands. Italicized text, either in typewriter or in regular font, refer to variable names. Sometimesthesearespecificvariablesinadataset(Domestic)whileatotherstheyaresimply aplace holderfor any variable orname that youmay choose(cont_var). Redtextis for DataAnalysis Tips. Blue text are generally links to datasets, examples or places in the document. Lab Procedures Let s go tothe movies! What are the characteristics of U.S. movies that make the most money? Let s address this question with the data set movies2012.dta. It comprises data on the 250 top domestic grossing movies of all time as of November 2012. The variables are: Variable Description Ranking Ranking on Domestic gross sales Title Movie title Year Release year Domestic Domestic gross sales Foreign Foreign gross sales Worldwide Worldwide gross sales Budget Budget Rating MPAA rating Best_Picture Academy Awards Best Picture (nominated or won) Genre Main/first genre of the movie

Stats 111 3 Exploratory Data Analysis Variable All_Genres Director Description List of all genres the movie falls into Name of the director There are missing data in this file. We ll ignore them for simplicity. In general, when confronted with missing data, it is best to get the advice of a professional statistician before doing analyses. Data Analysis Tip: The unit of measurement for the monetary variables is not stated. That s bad practice. Always include a description of the units somewhere on the file. Based on knowledgeof movie revenues,itis clear thatthattheunit ofmeasurementis $1,000,000. Questions: 1. After reading in the data, describe the distributions of foreign and domestic grosses. That is, say where most values are, note any outliers, and say whether the distribution is tightly packed around its mean or is spread out. Also, report the mean and standard deviation. In addition to summarizeing the data, you can use histograms to get a visual representation of thedistributionofthedata. Thecommand is histogram varname, name(graph1, replace), where varname is the variable of interest. As with most commands, there are many options available with histogram. One of the most useful options is name(graph1, replace). This stores a version of the graph in temporary memory to be used later and it causes the graph to be displayed in a window titled graph1. This allow multiple graph windows to be open at the same time 1. You use the replace option within the perenthesesof the name option to overwrite the previous version of the graph. Further, if you want to compare multiple histograms in one window you can combine them by typing graph combine graph1 graph2, name(graph_all, replace), where again graph1, graph2, and graph_all are just examples for names of the graphs. Youcan also navigate tographics-histogram togetawizard tohelpwith thegraph. DataAnalysisTip: Thedefaulthistogramin Stataisatruehistogram,wheretheareasofthe binssumtoone. Oftenpeoplewantjusttheheightstosumtoone. Thisisaccomplishedwith the fraction option. Further, if you want the y-axis to simply count how many observations are in each bin, youcan usethe frequencyoption. 2. Which sentence best describes the distributions of domestic and foreign grosses? You can justwritetheletterofyourchoice onthelab report. 1 Thedefaultisforeachgraph, regardlessoftype,to overwritethe previousgraph

Stats 111 4 Exploratory Data Analysis (a) Domestic and foreign grosses are very similar. (b) Domestic and foreign grosses have similar distributional shapes, but foreign grosses tend to be larger than domestic grosses. (c) Domestic and foreign grosses have similar distributional shapes, but domestic grosses tend to be larger than foreign grosses. (d) The two distributions look nothing like each other, because one has a long left tail and theotherhas along righttail. 3. What are the names of the two movies that are the largest outliers on all three monetary variables? 4. We can examine the relationship between world-wide gross and movie genre using a box plot. Usethevariable Genrefor thisanalysis. Thecommand for aboxplot is graph box cont_var, name(graph1, replace) or graph box cont_var, name(graph1, replace) over(cat_var), where cont_var represents the continuous variable that you are trying to graph. The option over(cat_var) allows you to break the box plot down by different values of a categorical variable. Alternatively, you can navigate to Graphics-Box Plot, type the continuous variable in the Main tab and the categorical variable in the Categories tab. If you want to clean up the graph,youcan testoutsomeof theothertabsin thewizard. Answer the three questions below. (a) Out of Comedy and Animated movies, which one has a distribution of world-wide grosses that is most similar to the distribution of world-wide grosses for Action movies? Justify your choice in at most two sentences. (b) Compare the distributions for Drama movies and Adventure movies. Do they have reasonably similar medians? Is one more spread out than the other (if so, say which one)? (c) If you directed a movie and wanted to make lots of money worldwide, which type appearstogiveyouthebestchanceofdoingso? Baseyouranswerontheresultsofthe box plot. 5. Describe the relationship between domestic gross and foreign gross. To make a scatter plot, wewill tellstatathatwewant todoatwowaygraph asascatter: graph twoway scatter varname1 varname2. Alternatively, you can navigate to Graphics-Twoway graph(scatter, line, etc.), go to the Plots tab, and hitthecreate button. Choose Scatter andtheyandxvariables.

Stats 111 5 Exploratory Data Analysis Items to include in your description are the general trend of the relationship (e.g., positive and linear, negative and linear, some other pattern, no clear pattern) and whether there are any outliersorpointsthat donotfitthepattern. 6. Report the three pairwise correlations between Foreign, Domestic, and World-wide gross. Further, graph the raw distribution of the data for each pair. To find correlations in Stata, type correlate varname1 varname2... whichwillshowamatrixofthecorrelationsbetweenallthevariablesusedasinput(youcan use as many as you d like). Note that the diagonal is always 1. Make sure you know why that is. In addition, you can create a scatterplot matrix which will create scatter plots of all the different outcomes by typing graph matrix varname1 varname2... Alternatively, you can navigate to Graphics-Scatterbox Matrix. And again, you can use multiple variables. Do the correlations suggest strongly positive linear relationships, weakly positive linear relationships, no linear relationships, weakly negative linear relationships, or strongly negative linear relationships? 7. Why are the correlations between Domestic and Worldwide, and Foreign and Worldwide, stronger than than the correlation between Domestic and Foreign? The answer has to do with the definitions of the variables. 8. Outliers can have a strong effect on correlations. Let s check to see if excluding Avatar and Titanic changes the correlations substantially. To exclude Avatar and Titanic, let s again use the if functionality of Stata by typing if Title!= Avatar & Title!= Titanic at the end of the previous commands. Data AnalysisTip: Notethat!= isdefinedas notequalto and isthe OR operator. Now, re-calculate the correlations in (6). Did the correlations get stronger or weaker? Does the substance of your conclusions in (6) change very much when excluding Avatar and Titanic? Data Analysis Tip: It is not acceptable to exclude outliers from analyses unless you have a scientific reason to do so (e.g., a data entry error, or maybe the outlying unit is not part of your target population). Hiding outliers is fudging data to get results you want. That is dishonest and unethical. When you see outliers, do analyses with and without them. When the results do not change much, report the results based on the full data set, and tell your audience that the results were not sensitive to the outliers. When the results do change substantially, report both sets of analyses: one with and one without the outliers.

Stats 111 6 Exploratory Data Analysis This honestly informs people that your conclusions are not on very solid ground, because particular data points affect the results greatly. Feel free to explore relationships between sales and other characteristics of movies, like rating, best picture nominations/wins, and director.