2. DATA AND EXERCISES (Geos2911 students please read page 8)

Similar documents
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

NCSS Statistical Software

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

II. DISTRIBUTIONS distribution normal distribution. standard scores

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Projects Involving Statistics (& SPSS)

CHAPTER 12 TESTING DIFFERENCES WITH ORDINAL DATA: MANN WHITNEY U

Chapter 3 RANDOM VARIATE GENERATION

UNDERSTANDING THE TWO-WAY ANOVA

Odds ratio, Odds ratio test for independence, chi-squared statistic.

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

3.4 Statistical inference for 2 populations based on two samples

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

MBA 611 STATISTICS AND QUANTITATIVE METHODS

HYPOTHESIS TESTING: POWER OF THE TEST

Non-Parametric Tests (I)

Using Excel for inferential statistics

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Describing Populations Statistically: The Mean, Variance, and Standard Deviation

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Using Excel for descriptive statistics

CALCULATIONS & STATISTICS

Normality Testing in Excel

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Statistical Functions in Excel

TIPS FOR DOING STATISTICS IN EXCEL

Permutation Tests for Comparing Two Populations

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Section 13, Part 1 ANOVA. Analysis Of Variance

Module 4 (Effect of Alcohol on Worms): Data Analysis

Data Analysis Tools. Tools for Summarizing Data

Introduction to Quantitative Methods

Confidence Intervals for the Difference Between Two Means

Study Guide for the Final Exam

StatCrunch and Nonparametric Statistics

Lecture Notes Module 1

Using Excel in Research. Hui Bian Office for Faculty Excellence

4. Continuous Random Variables, the Pareto and Normal Distributions

Fairfield Public Schools

Drawing a histogram using Excel

NCSS Statistical Software. One-Sample T-Test

Comparing Means in Two Populations

Lab 11: Budgeting with Excel

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Part 2: Analysis of Relationship Between Two Variables

Non-Inferiority Tests for Two Means using Differences

Skewed Data and Non-parametric Methods

Association Between Variables

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Simple Regression Theory II 2010 Samuel L. Baker

Additional sources Compilation of sources:

Difference of Means and ANOVA Problems

The Wilcoxon Rank-Sum Test

Statistics Review PSY379

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Chapter 7. Comparing Means in SPSS (t-tests) Compare Means analyses. Specifically, we demonstrate procedures for running Dependent-Sample (or

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SPSS/Excel Workshop 3 Summer Semester, 2010

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

ABSORBENCY OF PAPER TOWELS

Simple Linear Regression Inference

Using MS Excel to Analyze Data: A Tutorial

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Using SPSS, Chapter 2: Descriptive Statistics

Point Biserial Correlation Tests

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Statistics 2014 Scoring Guidelines

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

START Selected Topics in Assurance

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Descriptive Statistics

p ˆ (sample mean and sample

Independent samples t-test. Dr. Tom Pierce Radford University

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Step 3: Go to Column C. Use the function AVERAGE to calculate the mean values of n = 5. Column C is the column of the means.

Northumberland Knowledge

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

How To Check For Differences In The One Way Anova

Opgaven Onderzoeksmethoden, Onderdeel Statistiek

Using Microsoft Excel to Analyze Data

Stat 411/511 THE RANDOMIZATION TEST. Charlotte Wickham. stat511.cwick.co.nz. Oct

To launch the Microsoft Excel program, locate the Microsoft Excel icon, and double click.

Transcription:

2. DATA AND EXERCISES (Geos2911 students please read page 8) 2.1 Data set The data set available to you is an Excel spreadsheet file called cyclones.xls. The file consists of 3 sheets. Only the third is relevant to this week s practical. Sheet 3 Column 1 cyclone season. Column 2 cyclone identification number. Column 3 ocean basin the cyclone was generated. Column 4 central pressure of the cyclone in hpa. These data represent the total population of cyclones generated in the South Pacific Ocean (SPO) and South Indian Ocean (SIO). Note also: 1. The important aspect of this analysis is the intensity of each cyclone generated in Australian waters, particularly the numbers of the most intense Category 4 or greater cyclones. While it would also be useful to know their tracks to determine whether they crossed the coastline, such data is only available for cyclones back to 1980 (i.e. the data in Sheets 1 and 2). This is too short a time period for the low frequency large magnitude events that we are interested in today, thus we will investigate a longer record of cyclone intensity that exists back to 1907, and accept the shortcoming that we don t know whether they crossed the coastline or not. 2. Category 1 cyclone central pressures of 986-995 hpa Category 2 cyclone central pressures of 971-985 hpa Category 3 cyclone central pressures of 956-970 hpa Category 4 cyclone central pressures of 931-955 hpa Category 5 cyclone central pressures of <931 hpa 3. The lower the central pressure the more intense the cyclone. 4. Category 4 and 5 cyclones cause extensive damage and lead to major insured losses. 2.2 Exercises 1. Highlighting all of the columns with information in them (from Row 3 down), sort the data set according to ocean basin and then cut and paste the data so that you have a set of 4 columns for each basin next to each other. 2. Use the Tools Data analysis Histogram facility to produce a frequency histogram of the population of central pressures for cyclones generated in the South Indian Ocean. If you cannot find the histogram facility then use the Help menu and look for the FREQUENCY function. Produce a separate frequency histogram for cyclones generated in the South Pacific Ocean. Use a bin range of 900 to 1000 hpa with bin intervals of 10. Annotate the charts with appropriate axis labels and titles. Look at your plotted distributions does the data appear Normally distributed? 3. Calculate the mean of the central pressures for the cyclones generated in each ocean. This can be achieved using the AVERAGE function. Which ocean basin on average generates the most intense cyclones? 4. This question is intended to assess whether your answer in Step 3 above is

statistically significant. Insert a new worksheet into your Excel Workbook (Sheet 4) and copy your data sets for each ocean basin from Sheet 3 into Sheet 4. Now you are going to take a random sample of cyclone pressures from each ocean basin. The sample size will be 30 each from the South Indian and South Pacific Oceans. In a column next to the SIO, data create a column of 30 random numbers between 2 and 363, which is the range of row numbers in the SIO data set. Use the RANDBETWEEN functions to do this. Once you have the random numbers use the copy and paste special values facility to convert the cells from formulas to numbers, otherwise they will keep recalculating. Write down your list of random numbers for the South Indian Ocean on a sheet of paper. Then write next to each number on your sheet of paper the central pressure that corresponds to that row number. In the next column after your column of random numbers in Sheet 4 type in the corresponding central pressures. Repeat the exercise for the SPO data set, but collect 30 random numbers between 2 and 283. These are your random samples for each ocean basin. We want to assess if the average intensity of cyclones from South Indian Ocean is statistically equal to that of cyclones from the South Pacific Ocean. In statistics, an observation is statistically significant if it is unlikely to have occurred by chance. This question can be answered via statistical tools such as the Student s t-test and the Mann-Whitney test. Student s t-test for equivalence of means. Consider two samples x and y with sample size m and n, respectively. We are interested in the question are the means of x and y the same or different (i.e. is x = y or alternatively x > y ). In other words: Ho (null hypothesis): mean of population x = mean of population y H1 (alternate hypothesis): mean of population x > mean of population y The test statistic population m and n. x y t = 1 S. m + 1 n, in which S is the pooled variance of both With S = (m 1) *σ 2 2 x + (n 1) *σ y m + n 2 variance of m and n respectively. With in which σ x 2 and σ y 2 are the sample (x x ) 2 σ 2 x = m and (y y ) 2 σ 2 y = n If test statistic t is lower that the critical t given in the critical t distribution table (cf appendice) for the degree of freedom of the test (ν=m+n-2) then the null hypothesis is correct for the given degree of significance of the test. The principal assumption of the Student s t- test is that the samples are drawn from populations that are normally distributed (ie. characterized by data that cluster around the mean). The standard deviation σ expresses the dispersion of x i about the mean. Test the following hypothesis using a Student s t-test.

Null hypothesis: The mean of the central pressures of cyclones in the South Pacific Ocean is equal to the mean for the South Indian Ocean. Alternate hypothesis: The means of the central pressures of cyclones in the South Pacific Ocean is greater than the mean for the South Indian Ocean. You will first need to calculate the t-statistic, and then compare it to the critical t for the appropriate degrees of freedom and level of confidence. For both the South Indian and South Pacific oceans: 1- Calculate the pressure average. 2- Calculate for each cyclone the square of the difference between its pressure and the pressure average: (P-Average[P]) 2 3- Average all (P-Average[P]) 2, this is the variance of the pressure. 4- Calculate the pooled variance (S) of both the South Indian and South Pacific oceans: S = (m 1) *σ 2 2 x + (n 1) *σ y, in which σ 2 x and σ 2 y are the averaged m + n 2 (P-Average[P]) 2 for South Indian and South Pacific ocean. x y 5- Calculate the test statistic t = 1 S. m + 1 in which m is the number of n cyclones in the South Indian and n the number of cyclone in the South Pacific ocean; x and y are the pressure average for the South Indian and South Pacific oceans respectively. 6- Calculate the degree of freedom (ν) of the test: m+n-2. The mean of the central pressures of cyclones in the South Pacific Ocean is statistically equal to the mean for the South Indian Ocean when the calculated test statistic t is less that the critical t value given in the critical t distribution table. If it is not the case then the alternative hypothesis cannot be ruled out. Use the critical t distribution table and the degree of freedom (ν) to determine the probability that the calculated test statistic t is less that the critical t value in the t distribution table. The level of confidence (in %) is given by (100-α). Based on your statistical test complete the following sentence: We can be % confident that the mean of the central pressures of cyclones generated in the South Pacific Ocean (is or is not) significantly greater than the mean for the South Indian Ocean. Are the assumptions of the Student s t-test satisfied (recall your answer to Exercise 2)? How reliable is your test? 5. Insert a new worksheet in your Excel workbook (Sheet 5) and copy your sample of cyclone central pressures for the South Indian Ocean. Place a column of labels, SIO, next to them. Do the same for the South Pacific Ocean central pressures, but place them directly beneath the SIO sample. Use the RANK function to rank the central pressures in ascending order. Perform a Mann-Whitney test to determine at 95% confidence (α=5%) if the central pressures in the South Pacific and South Indian Oceans are significantly different. For this consider two random samples x and y with sample size m (SIO)

and n (SPO) respectively. We are interested in the question are the medians of x and y the same or different. In other words: Null hypothesis Ho: median of population x = median of population y Alternate hypothesis H1: median of population x > median of population y Mann-Whitney statistic for equivalence of medians. In statistics, the Mann- Whitney test assesses whether two samples of observations come from the same distribution. The Mann-Whitney test is useful in the same situations as the Student's t-test, and the question arises of which should be preferred. Consider two random samples x and y with sample size m and n respectively. We are interested in the question: Are the medians of x and y the same or different? In other words: Null hypothesis Ho: median of population x = median of population y Alternate hypothesis H1: median of population x > median of population y The test statistic t is calculated using: t = mn + m(m +1) 2 m R(x i ) i=1 where R(xi ) are the ranks of sample x and m is the sample size of x. The sample size of y is n. The test statistic t can be understood as the number of times observations in one sample precede observations in the other sample in the ranking. Critical values for t for the Mann-Whitney test are listed in the appendice. For the hypothesis stated above the appropriate test is a one-tail test (statistical test in which the critical region consists of all values that are less than a given value or greater than a given value, but not both). If the calculated test statistic t is less than the critical t we reject the null hypothesis. If it is greater, we cannot reject the null hypothesis. Note that there are no assumptions concerning the distribution of the samples or populations for the Mann-Whitney test. To perform a Mann-Whitney test one has to calculate the test statistic t: m m(m +1) t = mn + R(x 2 i ), in which R(x i ) are the ranks of sample x (x individual i=1 SIO cyclones), m is the number of SIO cyclones. Based on your statistical test complete the following sentence: We can be % confident that the mean of the central pressures of cyclones generated in the South Pacific Ocean (is or is not) significantly greater than the mean for the South Indian Ocean. Does the result differ from your t-test? Which test is more reliable in this case and why? Have you changed your mind regarding your answer to Exercise 3? 6. Insert a new worksheet in your Excel workbook (Sheet 6) and copy your data sets for each ocean basin from Sheet 3 into Sheet 6. In Sheet 6, highlighting all of the columns with information in them, sort the data set for the South Indian Ocean in ascending order according to central cyclone pressure. In the next column, enter a tag from 5 through to 1 that indicates the cyclone category based on the central pressures (see note 2 Section 2.1). Do the same for the South Pacific Ocean.

Copy that part of the list of years that includes Category 5 and 4 cyclones in the South Indian Ocean to a new location in Sheet 6. Sort this sub-list of years into ascending order. Next to this list, create a new list, which contains the number of Category 4 or greater cyclones that occurred in each decade: 1907-16; 1917-26; 1927-36;... 1997-06. Do the same for the South Pacific Ocean. Determine the average rate at which Category 4 or greater cyclones occur in a decade for both the South Indian and South Pacific Oceans. Find the probability that the time between two successive Category 4 or greater cyclones is less than 1 year for the South Indian Ocean. Do the same for the South Pacific Ocean. Use the inferences from the exponential distribution, which assumes that the number of Category 4 or greater cyclones occurring in successive decades has a Poisson distribution. Inferences from exponential distribution: If discrete events occur randomly and independently at the mean rate λ per time interval y (so that the number occurring in a time interval has a Poisson distribution with parameter λ), the intervals between events give rise to a relative frequency histogram conforming to an exponential distribution. The probability that the time between two successive events X is less than a given time period x can be evaluated by using the following result: Pr(X x) =1 Exp( λ x y ) where λ is the mean rate of occurrence per interval y. This result is based on several assumptions for a Poisson process: 1. The process is independent. 2. The probability of one occurrence in any time interval is approximately proportional to the size of the interval. 3. The process is stationary; i.e. the number of occurrences in a time interval has the same probability distribution for all time intervals. In other words, the value of λ should not have an increasing or decreasing trend with time. Is the probability of two Category 4 or greater cyclones (which cause major insured losses, see note Section 2.1) occurring in the one year relatively low (ca. <50%) or relatively high (ca. >50%) for the South Indian Ocean; for the South Pacific Ocean. Does the last assumption listed for a Poisson process (see Section 1) appear to be satisfied here? Repeat the calculations to find the probability that the time between two successive Category 4 or greater cyclones is less than 1 year for the South Indian Ocean, based only on the past 3 decades of data. Do the same for the South Pacific Ocean, but based on the last 4 decades of data. How does this change your answer to the previous question? What might be making the record of cyclone activity unsteady (i.e. increasing number of intense cyclones in recent years)? See Science and Nature articles on WebCT.

REPORT (Geos-2911 only) In addition to the indicated material from Prac 2, the graphs from Exercise 2 and results from Exercises 3 to 6 in this Prac 3 provide the basis for the following report, so make sure that you understand the concepts clearly and have produced the graphs correctly. You are working as a geoscientist for an insurance company and you have been asked to prepare a report addressing whether households and businesses in Port Hedland and Cairns should be charged the same premium for insurance against losses due to cyclones. Use your knowledge of the components involved in assessing risk (recall the Introduction lecture), as well as the exercises you have completed in Pracs 2 and 3, to write this report. Your report should have the following sections: Introduction, Data and Methods, Results, and Conclusion. The text should be no longer than 4 double spaced pages (excluding figures and tables). The results section of your report should incorporate all of the indicated graphs and answers to questions in Pracs 2 and 3. Your conclusion must make an explicit recommendation one way or the other regarding whether premiums should differ between the two towns and if so which should be higher. Note that there is no absolute right or wrong answer here; it depends on how you view risk. Make sure you justify your conclusion. nb: When you are writing your report, note that the occurrence of two Category 4 or greater cyclones crossing the coast in a year causes serious cash flow problems for insurance companies because of large successive payouts in a short period of time. Don t forget, however, that the analysis in this prac has been for all cyclones generated in the South Indian and South Pacific Oceans and not all of these necessarily cross the coast.