2. DATA AND EXERCISES (Geos2911 students please read page 8) 2.1 Data set The data set available to you is an Excel spreadsheet file called cyclones.xls. The file consists of 3 sheets. Only the third is relevant to this week s practical. Sheet 3 Column 1 cyclone season. Column 2 cyclone identification number. Column 3 ocean basin the cyclone was generated. Column 4 central pressure of the cyclone in hpa. These data represent the total population of cyclones generated in the South Pacific Ocean (SPO) and South Indian Ocean (SIO). Note also: 1. The important aspect of this analysis is the intensity of each cyclone generated in Australian waters, particularly the numbers of the most intense Category 4 or greater cyclones. While it would also be useful to know their tracks to determine whether they crossed the coastline, such data is only available for cyclones back to 1980 (i.e. the data in Sheets 1 and 2). This is too short a time period for the low frequency large magnitude events that we are interested in today, thus we will investigate a longer record of cyclone intensity that exists back to 1907, and accept the shortcoming that we don t know whether they crossed the coastline or not. 2. Category 1 cyclone central pressures of 986-995 hpa Category 2 cyclone central pressures of 971-985 hpa Category 3 cyclone central pressures of 956-970 hpa Category 4 cyclone central pressures of 931-955 hpa Category 5 cyclone central pressures of <931 hpa 3. The lower the central pressure the more intense the cyclone. 4. Category 4 and 5 cyclones cause extensive damage and lead to major insured losses. 2.2 Exercises 1. Highlighting all of the columns with information in them (from Row 3 down), sort the data set according to ocean basin and then cut and paste the data so that you have a set of 4 columns for each basin next to each other. 2. Use the Tools Data analysis Histogram facility to produce a frequency histogram of the population of central pressures for cyclones generated in the South Indian Ocean. If you cannot find the histogram facility then use the Help menu and look for the FREQUENCY function. Produce a separate frequency histogram for cyclones generated in the South Pacific Ocean. Use a bin range of 900 to 1000 hpa with bin intervals of 10. Annotate the charts with appropriate axis labels and titles. Look at your plotted distributions does the data appear Normally distributed? 3. Calculate the mean of the central pressures for the cyclones generated in each ocean. This can be achieved using the AVERAGE function. Which ocean basin on average generates the most intense cyclones? 4. This question is intended to assess whether your answer in Step 3 above is
statistically significant. Insert a new worksheet into your Excel Workbook (Sheet 4) and copy your data sets for each ocean basin from Sheet 3 into Sheet 4. Now you are going to take a random sample of cyclone pressures from each ocean basin. The sample size will be 30 each from the South Indian and South Pacific Oceans. In a column next to the SIO, data create a column of 30 random numbers between 2 and 363, which is the range of row numbers in the SIO data set. Use the RANDBETWEEN functions to do this. Once you have the random numbers use the copy and paste special values facility to convert the cells from formulas to numbers, otherwise they will keep recalculating. Write down your list of random numbers for the South Indian Ocean on a sheet of paper. Then write next to each number on your sheet of paper the central pressure that corresponds to that row number. In the next column after your column of random numbers in Sheet 4 type in the corresponding central pressures. Repeat the exercise for the SPO data set, but collect 30 random numbers between 2 and 283. These are your random samples for each ocean basin. We want to assess if the average intensity of cyclones from South Indian Ocean is statistically equal to that of cyclones from the South Pacific Ocean. In statistics, an observation is statistically significant if it is unlikely to have occurred by chance. This question can be answered via statistical tools such as the Student s t-test and the Mann-Whitney test. Student s t-test for equivalence of means. Consider two samples x and y with sample size m and n, respectively. We are interested in the question are the means of x and y the same or different (i.e. is x = y or alternatively x > y ). In other words: Ho (null hypothesis): mean of population x = mean of population y H1 (alternate hypothesis): mean of population x > mean of population y The test statistic population m and n. x y t = 1 S. m + 1 n, in which S is the pooled variance of both With S = (m 1) *σ 2 2 x + (n 1) *σ y m + n 2 variance of m and n respectively. With in which σ x 2 and σ y 2 are the sample (x x ) 2 σ 2 x = m and (y y ) 2 σ 2 y = n If test statistic t is lower that the critical t given in the critical t distribution table (cf appendice) for the degree of freedom of the test (ν=m+n-2) then the null hypothesis is correct for the given degree of significance of the test. The principal assumption of the Student s t- test is that the samples are drawn from populations that are normally distributed (ie. characterized by data that cluster around the mean). The standard deviation σ expresses the dispersion of x i about the mean. Test the following hypothesis using a Student s t-test.
Null hypothesis: The mean of the central pressures of cyclones in the South Pacific Ocean is equal to the mean for the South Indian Ocean. Alternate hypothesis: The means of the central pressures of cyclones in the South Pacific Ocean is greater than the mean for the South Indian Ocean. You will first need to calculate the t-statistic, and then compare it to the critical t for the appropriate degrees of freedom and level of confidence. For both the South Indian and South Pacific oceans: 1- Calculate the pressure average. 2- Calculate for each cyclone the square of the difference between its pressure and the pressure average: (P-Average[P]) 2 3- Average all (P-Average[P]) 2, this is the variance of the pressure. 4- Calculate the pooled variance (S) of both the South Indian and South Pacific oceans: S = (m 1) *σ 2 2 x + (n 1) *σ y, in which σ 2 x and σ 2 y are the averaged m + n 2 (P-Average[P]) 2 for South Indian and South Pacific ocean. x y 5- Calculate the test statistic t = 1 S. m + 1 in which m is the number of n cyclones in the South Indian and n the number of cyclone in the South Pacific ocean; x and y are the pressure average for the South Indian and South Pacific oceans respectively. 6- Calculate the degree of freedom (ν) of the test: m+n-2. The mean of the central pressures of cyclones in the South Pacific Ocean is statistically equal to the mean for the South Indian Ocean when the calculated test statistic t is less that the critical t value given in the critical t distribution table. If it is not the case then the alternative hypothesis cannot be ruled out. Use the critical t distribution table and the degree of freedom (ν) to determine the probability that the calculated test statistic t is less that the critical t value in the t distribution table. The level of confidence (in %) is given by (100-α). Based on your statistical test complete the following sentence: We can be % confident that the mean of the central pressures of cyclones generated in the South Pacific Ocean (is or is not) significantly greater than the mean for the South Indian Ocean. Are the assumptions of the Student s t-test satisfied (recall your answer to Exercise 2)? How reliable is your test? 5. Insert a new worksheet in your Excel workbook (Sheet 5) and copy your sample of cyclone central pressures for the South Indian Ocean. Place a column of labels, SIO, next to them. Do the same for the South Pacific Ocean central pressures, but place them directly beneath the SIO sample. Use the RANK function to rank the central pressures in ascending order. Perform a Mann-Whitney test to determine at 95% confidence (α=5%) if the central pressures in the South Pacific and South Indian Oceans are significantly different. For this consider two random samples x and y with sample size m (SIO)
and n (SPO) respectively. We are interested in the question are the medians of x and y the same or different. In other words: Null hypothesis Ho: median of population x = median of population y Alternate hypothesis H1: median of population x > median of population y Mann-Whitney statistic for equivalence of medians. In statistics, the Mann- Whitney test assesses whether two samples of observations come from the same distribution. The Mann-Whitney test is useful in the same situations as the Student's t-test, and the question arises of which should be preferred. Consider two random samples x and y with sample size m and n respectively. We are interested in the question: Are the medians of x and y the same or different? In other words: Null hypothesis Ho: median of population x = median of population y Alternate hypothesis H1: median of population x > median of population y The test statistic t is calculated using: t = mn + m(m +1) 2 m R(x i ) i=1 where R(xi ) are the ranks of sample x and m is the sample size of x. The sample size of y is n. The test statistic t can be understood as the number of times observations in one sample precede observations in the other sample in the ranking. Critical values for t for the Mann-Whitney test are listed in the appendice. For the hypothesis stated above the appropriate test is a one-tail test (statistical test in which the critical region consists of all values that are less than a given value or greater than a given value, but not both). If the calculated test statistic t is less than the critical t we reject the null hypothesis. If it is greater, we cannot reject the null hypothesis. Note that there are no assumptions concerning the distribution of the samples or populations for the Mann-Whitney test. To perform a Mann-Whitney test one has to calculate the test statistic t: m m(m +1) t = mn + R(x 2 i ), in which R(x i ) are the ranks of sample x (x individual i=1 SIO cyclones), m is the number of SIO cyclones. Based on your statistical test complete the following sentence: We can be % confident that the mean of the central pressures of cyclones generated in the South Pacific Ocean (is or is not) significantly greater than the mean for the South Indian Ocean. Does the result differ from your t-test? Which test is more reliable in this case and why? Have you changed your mind regarding your answer to Exercise 3? 6. Insert a new worksheet in your Excel workbook (Sheet 6) and copy your data sets for each ocean basin from Sheet 3 into Sheet 6. In Sheet 6, highlighting all of the columns with information in them, sort the data set for the South Indian Ocean in ascending order according to central cyclone pressure. In the next column, enter a tag from 5 through to 1 that indicates the cyclone category based on the central pressures (see note 2 Section 2.1). Do the same for the South Pacific Ocean.
Copy that part of the list of years that includes Category 5 and 4 cyclones in the South Indian Ocean to a new location in Sheet 6. Sort this sub-list of years into ascending order. Next to this list, create a new list, which contains the number of Category 4 or greater cyclones that occurred in each decade: 1907-16; 1917-26; 1927-36;... 1997-06. Do the same for the South Pacific Ocean. Determine the average rate at which Category 4 or greater cyclones occur in a decade for both the South Indian and South Pacific Oceans. Find the probability that the time between two successive Category 4 or greater cyclones is less than 1 year for the South Indian Ocean. Do the same for the South Pacific Ocean. Use the inferences from the exponential distribution, which assumes that the number of Category 4 or greater cyclones occurring in successive decades has a Poisson distribution. Inferences from exponential distribution: If discrete events occur randomly and independently at the mean rate λ per time interval y (so that the number occurring in a time interval has a Poisson distribution with parameter λ), the intervals between events give rise to a relative frequency histogram conforming to an exponential distribution. The probability that the time between two successive events X is less than a given time period x can be evaluated by using the following result: Pr(X x) =1 Exp( λ x y ) where λ is the mean rate of occurrence per interval y. This result is based on several assumptions for a Poisson process: 1. The process is independent. 2. The probability of one occurrence in any time interval is approximately proportional to the size of the interval. 3. The process is stationary; i.e. the number of occurrences in a time interval has the same probability distribution for all time intervals. In other words, the value of λ should not have an increasing or decreasing trend with time. Is the probability of two Category 4 or greater cyclones (which cause major insured losses, see note Section 2.1) occurring in the one year relatively low (ca. <50%) or relatively high (ca. >50%) for the South Indian Ocean; for the South Pacific Ocean. Does the last assumption listed for a Poisson process (see Section 1) appear to be satisfied here? Repeat the calculations to find the probability that the time between two successive Category 4 or greater cyclones is less than 1 year for the South Indian Ocean, based only on the past 3 decades of data. Do the same for the South Pacific Ocean, but based on the last 4 decades of data. How does this change your answer to the previous question? What might be making the record of cyclone activity unsteady (i.e. increasing number of intense cyclones in recent years)? See Science and Nature articles on WebCT.
REPORT (Geos-2911 only) In addition to the indicated material from Prac 2, the graphs from Exercise 2 and results from Exercises 3 to 6 in this Prac 3 provide the basis for the following report, so make sure that you understand the concepts clearly and have produced the graphs correctly. You are working as a geoscientist for an insurance company and you have been asked to prepare a report addressing whether households and businesses in Port Hedland and Cairns should be charged the same premium for insurance against losses due to cyclones. Use your knowledge of the components involved in assessing risk (recall the Introduction lecture), as well as the exercises you have completed in Pracs 2 and 3, to write this report. Your report should have the following sections: Introduction, Data and Methods, Results, and Conclusion. The text should be no longer than 4 double spaced pages (excluding figures and tables). The results section of your report should incorporate all of the indicated graphs and answers to questions in Pracs 2 and 3. Your conclusion must make an explicit recommendation one way or the other regarding whether premiums should differ between the two towns and if so which should be higher. Note that there is no absolute right or wrong answer here; it depends on how you view risk. Make sure you justify your conclusion. nb: When you are writing your report, note that the occurrence of two Category 4 or greater cyclones crossing the coast in a year causes serious cash flow problems for insurance companies because of large successive payouts in a short period of time. Don t forget, however, that the analysis in this prac has been for all cyclones generated in the South Indian and South Pacific Oceans and not all of these necessarily cross the coast.