1 Statistics Notes Revision in Maths Week 1
2 Section  Producing Data 1.1 Introduction Statistics is the science that studies the collection and interpretation of numerical data. Statistics divides the study of data into 3 parts: Producing data Organising and describing data (Descriptive statistics) Drawing conclusions from data (Statistical Inference) Data Analysis Conclusions The type of data determines which methods of analysis are appropriate and valid. The major distinction is between quantitative (numeric) and qualitative (categorical) data. 2
3 1.2 Types of Data Qualitative data: labels or names that are usually nonnumeric e.g. gender, social class Quantitative data: numeric data e.g. age, height Quantitative Discrete Continuous Discrete: values change by whole numbers or steps e.g. family size Continuous: can take all values in a given range including decimal places e.g. height 1.3 Scales of Data Measurement As well as categorising data into certain types, we can also categorise data by levels or scales of measurement. Why do we need to know the scale of measurement of our data? Not only do we use different methods of analysis for different types of data but some methods of analysis require the data to be measured at a certain level as well. Qualitative data is usually measured at a nominal or ordinal scale of measurement. Nominal: the data gives the person a label or tells us what category a person or object falls into e.g. the colour of a car. Ordinal: the data has all the properties of nominal but we also get more information since the order of the data is meaningful e.g. do you think this module is very easy easy challenging very challenging Whatever you answer is still just a word i.e. nonnumeric but it gives more information than the colour of a car. All students can be ranked depending on what they think of the module. Quantitative data is usually measured at an interval or ratio scale of measurement. Interval: the data is numeric and has order in the same way as ordinal data has order. We can also measure the difference between two observations e.g. we can measure the difference between two temperatures of 10 degrees and 20 degrees i.e. 10 degrees. We cannot measure the difference between two students where one thinks the module is easy and the other finds it challenging i.e. we will not get a meaningful number. 3
4 Ratio: the data is numeric and has all the properties of interval data. The ratio of the data is also meaningful e.g. we can say things like I m half her age or twice as tall. The data has a meaningful and unique zero point which allows us to say if you have zero age then you don t exist. Temperature does not have a unique and meaningful zero point. For example, if the temperature is 0 C it does not mean temperature does not mean temperature does not exist. Temperature also depends on which temperature scale you are using. Classify the data generated by the questions in the following questionnaire by data type. Please answer all questions by ticking the relevant box or entering a number. Q1. Are you male female? Q2. Do you live in an urban rural area? Q3. How many siblings do you have? Q4. In a typical week, how much text messages do you send? Q5. How much did you spend on your last haircut? Q6. How long (in weeks) since your last haircut? Q7. In a typical week, how often would you exercise? Never Once or twice a week Several times a week Every day Q8. Rate your interest in Irish politics on a scale of 0 to 10 where 0 represents no interest and 10 represents extremely interested. Q9. Rate your interest in environmental issues on a scale of 0 to 10 where 0 represents no interest and 10 represents extremely interested. Thank you for taking the time to fill in this questionnaire. 4
5 1.4 Sampling We often read headlines in newspapers saying things like 40% of the population is satisfied with the government s performance. How can the newspaper make such a statement when they haven t asked everyone in the country their opinion of the government? The newspaper hasn t the time or money to ask everyone in the population so has taken a representative subset of the population and assumed that what happens for that subset is what happens for the whole population. Is this assumption valid? How do you select a representative subset? What mistakes can you make selecting this subset and what can be done to correct these mistakes? Without understanding the concepts behind selecting a subset of the population i.e. sampling, we can make serious errors in our conclusions about the population Definitions First, we need to define some terms. These terms will be illustrated using the example of a preelection poll on which political party is going to win the election. Population: the entire group of objects/subjects about which information is wanted. For our example, the population is all adults on the electoral register. Sample: any subset of a population e.g. a representative subset of individuals from the electoral register. Unit: any individual member of the population e.g. an individual on the electoral register. Sampling frame: a list of the individuals in the population e.g. the electoral register. Variable: we can measure its value for each person and its value will change from person to person e.g. the political party the individual will vote for. Parameter: this represents some value (e.g. an average value or a percentage) that we are interested in calculating for the population for example the percentage of adults on the electoral register who will vote for a particular political party or the average age of the voters. We will never find out these values unless we ask everyone in the population. The value of a parameter is fixed but usually unknown. We estimate it by using information from a sample. Statistic: this represents some value (e.g. an average value or a percentage) that we are interested in calculating for the sample for example the percentage of a representative sample of adults who will vote for a particular political party or the average age of this representative sample. We can find out these values since we can ask everyone in the sample but if we took a different sample of people we might get a different answer. The value of a statistic is not fixed but is known. We estimate the value of the parameter by using the value of the statistic. 5
6 A T.D. is interested in the percentage of her constituents who favour the introduction of an environmental tax on carbon dioxide emissions. Her staff report that letters on the issue have been received from 300 constituents and that 200 of these are in favour of introducing the tax. Identify the population, variable measured, the sample, the parameter of interest and the value of the statistic Bias in Sampling If there is a tendency for a certain group of the population to be omitted from the sample or if people who refuse to cooperate form a group which is, in some way, different to the sample, we have what is called a biased sample. The definition of bias in the context of sampling is a systematic tendency to overestimate or underestimate the population parameter of interest. For the election example, if my sample consists solely of people from a disadvantaged area with high unemployment, I may underestimate the level of support for a particular political party. How can we eliminate bias? Bias can be eliminated by taking a random sample. This is a sample where everyone in the population has the same chance of getting into the sample and the fact that one individual has got into the sample does not affect the chances of another individual getting into the sample i.e. everyone has an independent and equal chance of being included in the sample. This method of sampling is called simple random sampling Lack of precision in sampling Precision is a measure of how close a statistic is expected to be to the true value of a parameter. Lack of precision occurs when every time we take a sample and ask the question of interest, we get a very different answer i.e. the result of the sampling is not repeatable. How can we make conclusions for the population if we get different answers from each sample? How can we correct this problem? If we increase the sample size, we also increase the repeatability or precision of our results. A large random sample will include many people with lots of different characteristics whereas a small sample will not have the same range of people or characteristics. Thus, the results from one small sample may differ considerably from the results of another small sample depending on the range of people and characteristics it includes. 6
7 1.5 Types of studies We often design studies to collect specific information. There are different types of studies. These include: Observational studies: researcher collects information on attributes or measurements of interest but does not influence events e.g. surveys, epidemiological studies. Experimental studies: the researcher deliberately influences events and investigates the effects of the intervention e.g. laboratory studies and clinical trials. Prospective studies: data are collected forward in time from the start of the study experiments are prospective. Retrospective studies: data refer to past events and may be acquired from existing sources. Longitudinal studies: investigate changes over time, possibly in relation to an intervention. Observations are taken at more than one time point. Experiments are usually longitudinal. Crosssectional studies: individuals are observed only once Experimental studies In experimental studies, the researcher deliberately influences events and investigates the effects of the intervention. The aim of experimental studies is to study the effect of variables the experimenter manipulates (explanatory variables or factors) on variables that describe the outcome (response variables). A unit is the object on which the experiment is carried out. When the units are human beings, they are called subjects. A treatment is any specific experimental condition applied to units. The design of the experiment is the pattern or outline according to which treatments are applied to units. Example: An experimental study to compare pain levels for a group of subjects given different analgesics to combat migraine. What is the response variable? What is the explanatory variable? Structure of an Experiment The experimenter wants to see if an explanatory variable or treatment has any effect on the response variable. A control condition is often introduced against which the effects of an explanatory variable can be compared e.g. a comparison or control group to which the experimental procedure is not applied. For example, in a study to investigate if users of visual display terminals (VDTs) get eye strain or 7
8 backache, the same questions should be asked of a control group of comparable employees who do not use VDTs. Before an experiment is carried out, several questions must be answered: 1. How many treatments are to be studied? 2. How many times does each treatment need to be observed? 3. What are the experimental units? 4. How does the experimenter apply the treatments to the available experimental units and then observe the responses? 5. Can the resulting design be analysed or can the desired comparisons be made? Consider an experiment involving t treatments in which each treatment is applied to r different experimental units. The researcher must select rt experimental units and randomly assign each treatment to r of the experimental units. Random does not mean the same as haphazard! By random allocation we mean that each patient has a known chance, usually an equal chance, of being given each treatment but the treatment to be given cannot be predicted beforehand. The simplest method of random allocation for two groups is tossing a coin heads is treatment A, tails is treatment B. Other methods include the use of random number tables. 1.6 Summary There are two types of data qualitative (nonnumeric) and quantitative (numeric). Data is usually collected from a sample (representative group from the population). Ideally the sample should be large and randomly selected. Data can be collected from different types of studies including experiments where the person carrying out the study makes an intervention and investigates the effect of that intervention. 8
9 Problem Session: Producing Data Q1. The Gardai Siochana wants to know how Dublin inner city residents feel about the police service. A questionnaire with several questions about the police is prepared. A sample of 300 mailing addresses in inner city areas is chosen, and a Garda is sent to each address to administer the questionnaire to an adult living there. Identify the population, variables measured and the sample. In addition, describe the potential bias. Q2. The ESRI announces that it interviewed all members of the labour force in a sample of 55,800 households, of whom 11.9% were unemployed. Is 11.9% a statistic or a parameter? Q3. Just before a presidential election, a market research company increases the size of its weekly randomly selected sample from the usual 1,500 people to 4,000 people. Does the larger sample lessen the bias of the poll result? Q4. A researcher wants to find out the percentage of small and medium sized Irish companies that have and Internet access. A list of small and medium sized Irish companies was obtained from the subscription list to a business magazine. 100 companies were randomly selected from this list using systematic sampling. Of these 100 companies, 50 responded to the researcher s survey. 35 of the 50 companies who responded had and Internet access. (i) (ii) (iii) (iv) For this example, identify the population, the sampling frame, the sample and the variable measured. What is the parameter of interest in this example? What is the estimate of this parameter? Describe the potential bias in this example. Q5. A department store collected information from a random sample of people in their shop on a Saturday morning. The information was as follows: Marital Status of customer (Single, Married, Widowed, Divorced) How frequently the customers shopped in the store i.e. (Rarely, Occasionally, Weekly, Daily) Amount spent on a typical visit to the store Age of customer (in years) What type of data is generated by each of these variables? 9
10 Q6. A mobile phone company wanted to find out how its customers rated the customer service offered by its call centre operators. A list of all customers was obtained and 500 customers were randomly selected from the list. These 500 customers were posted a questionnaire asking them to rate customer service on a scale of 0 to 10. Customers who responded were given 10 credit on their account. 250 customers responded and gave a mean rating of 5 for customer service. (i) (ii) (iii) (iv) For this example, identify the population, the sampling frame, the sample and the variable measured. What is the main parameter of interest in this example? What is the best estimate of this parameter? Describe the potential bias in this example. Q7. Some people believe that exercise raises the body's metabolic rate for as long as 12 to 24 hours and thus enables us to continue to burn off fat after we end our workout. In a study of this effect, subjects were asked to walk briskly on a treadmill for several hours. Their metabolic rate was measured before, immediately after and 12 hours after the exercise. Was this study an experiment? Why or why not? What are the explanatory and response variables? Q8. In order to determine if living next to highvoltage power lines increases the risks of getting cancer, researchers selected several homes at random, determined if they were within 50 yards of a highvoltage power line and recorded whether anyone in the home had cancer. They compared the proportion of cancer cases in homes within 50 yards of a highvoltage power line to the proportion in homes more than 50 yards from a highvoltage power line. (i) (ii) (iii) What is the response variable in this study? What is the explanatory variable in this study? Is this an experiment? Why or why not? What other important variables may be present in this study? 10
11 Solutions to Problem Session 1 Q1. Population: All Dublin inner city residents. Variable measured: opinion on the police service e.g. rating scale. Sample: 300 adults living at the 300 addresses chosen who are willing to respond (not given any information on the response rate). Potential bias: The respondent may overestimate positive feedback on police service because a Garda is asking the questions (may be slow to give negative feedback to a Garda) would be better to have someone neutral or trusted by the community to carry out the survey. Would also need information on the response rate. Q % is a statistic since it is calculated from a sample. Q3. A larger sample will not affect bias. Increasing the sample size improves precision, i.e. the repeatability of the sampling. Q4. (i) Population: all small and medium sized Irish companies. Sampling frame: the small and medium sized Irish companies on a subscription list to a business magazine. Sample: The target sample is the 100 companies selected but the actual sample is the 50 companies who responded. Variable measured: Have you and Internet access? Yes/No (ii) Parameter of interest: percentage of all small and medium sized Irish companies who have and Internet access (iii) Best estimate of this parameter is the statistic which is 35/50 = 70%. (iv) Potential bias: Low response rate. May have over or underestimated the percentage because half the companies did not respond and these could be different from those who did respond. Also, are all Irish small and medium sized companies on the subscription list? If not, are those on the list different from those not on the list? Q5. Qualitative (nominal) Qualitative (ordered) Quantitative (ratio) Quantitative (ratio) 11
12 Q6. (i) Population: all customers of the mobile phone company Sampling frame: the list of all customers Sample: The target sample is the 500 companies selected but the actual sample is the 250 customers who responded. Variable measured: Rating of customer service on a scale of 0 to 10. (ii) Parameter of interest: mean rating of customer service by all customers (iii) Best estimate of this parameter is the statistic which is 5. (iv) Potential bias: Low response rate. May have over or underestimated the rating because half the customers did not respond and these may have different ratings than those who did respond. Also, there was a monetary incentive to respond will customers who want 10 in credit be different from other customers will this affect their rating? Q7. Study is an experiment since the researcher intervened i.e. the subjects had to walk briskly on a treadmill. Explanatory variable: time since exercise (before, immediately after and 12 hours after). Response variable: metabolic rate. Q8. Response variable: Whether anyone in the home had cancer Explanatory variable: is the house within 50 yards of a highvoltage power line? Yes/No. This is not an experiment since the behaviour of the subjects has not been affected no intervention took place. Other variables of interest (called confounding variables): family history, occupation etc. Need to define response variable carefully are all types of cancer of interest? Are only those currently living in the house of interest? Are cancer cases of previous residents in the house included? Are cancer deaths in the household included? 12
