MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

MEI Mathematics i Educatio ad Idustry MEI Structured Mathematics Module Summary Sheets Statistics (Versio B: referece to ew book) Topic : The Poisso Distributio Topic : The Normal Distributio Topic 3: Samples ad Hypothesis Testig. Test for populatio mea of a Normal distributio. Cotigecy Tables ad the Chi-squared Test Topic 4: Bivariate Data Purchasers have the licece to make multiple copies for use withi a sigle establishmet November, 5 MEI, Oak House, 9 Epsom Cetre, White Horse Busiess Park, Trowbridge, Wiltshire. BA4 XG. Compay No. 36549 Eglad ad Wales Registered with the Charity Commissio, umber 589 Tel: 5 776776. Fax: 5 775755.

Summary S Topic The Poisso Distributio Chapter Pages -4 Example. Page 4 Exercise A Q. 7 Chapter Pages 5-6 Example. Page 6 Exercise A Q. 4, 5, Chapter Pages -5 Exercise B Q. 4, 8 Chapter Pages 8- Exercise C Q. (i), 3, 7 Exercise D Q. 4, 5 The Poisso distributio is a discrete radom variable X where r λ λ P( X r) e r! The parameter, λ, is the mea of the distributio. We write X ~ Poisso(λ ). The distributio may be used to model the umber of occureces of a evet i a give iterval provided the occurreces are: (i) radom, (ii) idepedet, (iii) occurrig at a fixed average rate. Mea, E(X) λ, Variace, Var(X) λ Mea Variace is a quick way of seeig if a Poisso model might be appropriate for some data. It is possible to ca.culate terms of the Poisso distributio by a recurrece relatioship. r λ λ E.g. P( X r) e ; r! r + λ λ λ P( X r + ) e P( X r) ( r + )! ( r + ) Care eeds to be take over the cumulatio of errors. Use of cumulative probability tables Cumulative Poisso probability tables are o pages 4-4 of the Studets' Hadbook ad are available i the examiatios. They give cumulative probabilities, i.e. P(X r). So P( X r) P( X r) P( X r ) For λ.8 (Page 4), the secod ad third etry of the tables give P( X ).468, P( X ).736 i.e. P( X ) P( X ) P( X ).736.468.678 Sum of Poisso distributios Two or more Poisso distributios ca be combied by additio providig they are idepedet of each other. If X ~ Poisso(λ) ad Y ~ Poisso(μ), the X + Y ~ Poisso(λ +μ). N.B. You may oly add two Poisso distributios i this way if they are idepedet of each other. There is o correspodig result for subtractio. Approximatio to the biomial distributio The Poisso distributio may be used as a approximatio to the biomial distributio B(, p) whe (i) is large, (ii) p is small (so the evet is rare). The λ p. Note that the Normal distributio is likely to be a good approximatio if p is large. Statistics Versio B: page Competece statemets P, P, P3, P4, P5 E.g. The mea umber of telephoe calls to a office is every miutes. The probability distributio (Poisso()) is as follows: P( X ) e.353 P( X ) e.77 P( X ) e e.77! 3 P( X 3) e.84 3! 4 P( X 4) e.9 4! P( X > 4) sum of above.57 E.g. For λ, fid P(3) from tables. 3-8 Note that P(3) e.353 3! 6.84 From tables, P(3) P( X 3) P( X ).857.6767.84 E.g. Vehicles passig alog a road were couted ad categorised as either private or commercial; o average there were 3 private ad commercial vehicles passig a poit every miute. It is assumed that the distributios are idepedet of each other. Fid the probability that i a give miute there will be (i) o private vehicles, (ii) o vehicles, (iii) exactly oe vehicle. For the private vehicles, λ p 3 ad for commercial vehicles λ c. The distributio may be modelled by the Poisso distributio so that for private vehicles, X ~ Poisso(3) ad for commercial vehicles, Y ~ Poisso(). For all vehicles, Z X + Y ~ Poisso(3+) Poisso(5) (i) P(X ) e 3.498 (ii) P(Z ) e 5.67 (iii) P(Z ) 5e 5.337 Note that P( veh) P( priv). P( comm) + P( priv). P ( comm) 3e 3.e + e 3.e ( + 3)e 5 5e 5.337 E.g. Some equivalet values: For X~B(4,.3), p.; P() (.97) 4.96.3; For X~Poisso(.), P() e -..3.3 For X~B(8,.), p.6; P() (.98) 8.99.; For X~Poisso(.6), P() e -.6..

Summary S Topic The Normal Distributio Chapter Pages 3-44 Example. Page 35 Exercise A Q. 4 Chapter Pages 49, 5 Example.3 Page 49 Exercise B Q. Chapter Pages 5-5 Exercise B Q. 4 Chapter Pages 5-54 Exercise B Q. Exercise C Q., 6, 7 The Normal distributio, N(μ, ), is a cotiuous, symmetric distributio with mea μ ad stadard deviatio. The stadard Normal distributio N(,) has mea ad stadard deviatio. P(X<x ) is represeted by the area uder the curve below x. (It is a special case of a cotiuous probability desity fuctio which is a topic i Statistics 3.) The area uder the stadard Normal distributio curve ca be foud from tables. To fid the area uder ay other Normal distributio curve, the values eed to be stadardised by the formula x μ z Modellig May distributios i the real world, such as adult heights or itelligece quotiets, ca be modelled well by a Normal distributio with appropriate mea ad variace. Give the mea μ ad stadard deviatio, the Normal distributio N(μ, ) may ofte be used. Whe the uderlyig distributio is discrete the the Normal distributio may ofte be used, but i this case a cotiuity correctio must be applied. This requires us to take the mid-poit betwee successive possible values whe workig with cotiuous distributio tables. E.g. P(X) > 3 meas P(X > 3.5) if X ca take oly iteger values. The Normal approximatio to the biomial distributio. This is a valid process provided (i) is large, (ii) p is ot too close to or. Mea p, Variace pq. The approximatio will be N(p, pq). A cotiuity correctio must be applied because we are approximatig a discrete distributio by a cotiuous distributio. The Normal approximatio to the Poisso distributio. This is a valid process provided λ is sufficietly large for the distributio to be reasoably symmetric. A good guidelie is if λ is at least. For a Poisso distributio, mea variace λ The approximatio will be N(λ,λ). As with the Biomial Distributio, a cotiuity correctio must be applied because we are approximatig a discrete distributio by a cotiuous distributio. Statistics Versio B: page 3 Competece statemets N, N, N3, N5 For N(,), P(Z < z ) ca be foud from tables (Studets Hadbook, page 44) E.g. P(Z <.7).758 P(Z >.7).758.4 For N(,9) [μ, 3] P(X < 5) P(Z < z ) where z (5 ) / 3.843 E.g. The distributio of masses of adult males may be modelled by a Normal distributio with mea 75 kg ad stadard deviatio 8 kg. Fid the probability that a ma chose at radom will have mass betwee 7 kg ad 9 kg. We require P(7< X < 9) P( z < Z < z) 7 75 where z.65 8 9 75 ad z.875 8 P(7 < X < 9) P( Z <.875) P <.65 P( Z <.875) P( <.65).9696 (.734).736 E.g. Fid the probability that whe a die is throw 3 times there are at least sixes. Usig the biomial distributio requires P(3 sixes) + P(9 sixes) +...+ P( sixes). However, usig N(p, pq) where 3 ad p / 6, gives N(5, 4.67). P(X > 9.5) P(Z > z ).9863.37 ( Z ) ( Z ) where z (9.5 5) /.4.5 N.B. a cotiuity correctio is applied because the origial distributio (biomial) is beig approximated by a cotiuous distributio (Normal). E.g. A large firm has 5 telephoe lies. O average, 4 lies are i use at oce ad the distributio may be modelled by Poisso(4). Fid the probability of there ot beig eough lies. The distributio is Poisso(4). Approximate by N(4,4). The we require P( X > 5) P( Z > z) 5.5 4 where z.66 4 P( X > 5).955.485

Summary S Topic 3 Samples ad Hypothesis Testig : Estimatig the populatio mea of a Normal distributio Pages 68-7 Example 3. Page 7 Pages 7-73 Exercise 3A Q. (i),(iii) Pages 73-74 Example 3.3 Page 74 Exercise 3A Q. 6 The distributio of sample meas If a populatio may be modelled by a Normal distributio ad samples of size are take from the populatio, the the distributio of meas of these samples is also Normal. ( μ ) If the paret populatio is N, the the samplig distributio of meas is N μ,. Hypothesis test for the mea usig the Normal distributio Tests o the mea usig a sigle sample. H is μ μ where μ is some specified value. H may be oe tailed: μ <μ or μ >μ or two tailed: μ μ. I other words, give the mea of the sample take we ask the questio, Could the mea of the paret populatio be what we thik it is? Suppose the paret populatio is N μ,, the the ( ) samplig distributio of meas is N μ,.the critical values are therefore μ ± k where the value of k depeds o the level of sigificace ad whether it is a oe or two-tailed test. x μ We therefore calculate the value z ad compare it to the value foud i tables. Alteratively, if the value of the mea of the sample lies iside the acceptace regio the we would accept H, but if it lay i the critical regio the we would reject H i favour of H. Alteratively, calculate the probability that the value is greater tha the value foud ad see if it less tha the sigificace level be used. Kow ad estimated stadard deviatio The hypothesis test described above requires the value of the stadard deviatio of the paret populatio. I reality the stadard deviatio of the paret populatio will usually ot be kow ad will have to be estimated from the sample data. If the sample size is sufficietly large, the s.d. of the sample may be used as the s.d. of the paret populatio. A good guidelie is to require 5. E.g. If the paret populatio is N(, 6) ad a sample of size 5 has mea 8.6, the this value comes from the samplig distributio of meas which is N(,.64). E.g. It is thought that the paret populatio is Normally distributed with mea. A radom sample of 5 data items has a sample mea of 4. ad s.d. 8.3. Is there ay evidece at the.% sigificace level that the mea of the populatio is ot? H : μ H: μ (Note that although the mea of the sample is greater tha the proposed mea, we do ot have μ> because of the wordig of the questio.) x μ 4. z 3.578 8.3 5 Critical value from tables for two-tailed,.% sigificace level is 3.7 Sice 3.578>3.7 we reject Hi favour of H. There is evidece that the mea of the populatio is ot. E.g. A populatio has variace 6. It is required to test at the.5% level of sigificace whether the mea of the populatio could be or whether it is less tha this. A radom sample size 5 has a mea of 8.6. H: μ H: μ < k.58 (for.5% level, -tailed test) 4 Critical value is.58.58 7.936 5 Sice 8.6 > 7.936 we accept H ;there is o evidece at the.5% level of sigificace that the mea is less tha. Alteratively, if the mea is the the samplig distributio of meas is N(,.64) 8.6 The P( X 8.6) Φ Φ(.75).8.9599.4. Sice.4 >.5 we accept H Statistics Versio B: page 4 Competece statemets N6

Summary S Topic 3 Samples ad Hypothesis Testig : Cotigecy Tables ad the Chi-squared Test Pages 8-85 Cotigecy Tables Suppose the elemets of a populatio have sets of distict characteristics {X, Y}, each set cotaiig a fiite umber of discrete characteristics X {x, x,.,x m } ad Y {y,y,.,y } the each elemet of the populatio will have a pair of characteristics (x i,y j ). The frequecy of these m pairs (x i,y j ) ca be tabulated ito a m cotigecy table. y y y E.g. a group of 5 studets was selected at radom from the whole populatio of studets at a College. Each was asked whether they drove to College or ot ad whether they lived more tha or less tha km from the College. The results are show i this table. Nearer tha km Further tha km Drives 7 8 x f, f, f, x f, f, f, Does ot drive 5 7 6 4 5 Chapter Pages 87-9 Page 85 Exercise 3B Q. 4, 5 Exercise 3C Q., 8 x m f m, f m, f m, The margial totals are the sum of the rows ad the sum of the colums ad it is usual to add a row ad a colum for these. The requiremet is to determie the extet to which the variables are related. If they are ot related but idepedet, the theoretical probabilities ca be estimated from the sample data. You ow have two tables, oe cotaiig the actual (observed) frequecies ad the other cotaiig the estimated expected frequecies based o the assumptio that the variables are idepedet. The hypothesis test H : The variables are ot associated. H : The variables are associated. The χ Statistic (Chi-squared statistic) This statistic measures how far apart are the set of observed ad expected frequecies. X ( observed frequecy - expected frequecy) ( fo fe) f e expected frequecy Degrees of freedom The distributio depeds o the umber of free variables there are, called the degrees of freedom, ν. This is the umber of cells less the umber of restrictios placed o the data. For a table such as the example give the umber of cells to be filled is 4, but the overall total is 5 which is a restrictio ad the proportios for each variable were also estimated from the data, givig two further restrictios. So the umber of degrees of freedom i the example is. I geeral the umber of degrees of freedom for a m table is (m )( ). Statistics Versio B: page 5 Competece statemets H, H E.g. If drivig to College ad the distace lived are ot associated evets the if oe studet is chose at radom the estimated probabilities are 8 4 P(drives), P(lives further tha km) 5 5 8 4 ad P(drives ad lives further tha km) 5 5.688 So out of 5 people we would expect 5.688 3.44 I a similar way the etries i the other three boxes are calculated to give the followig: Nearer tha km Further tha km Drives 4.56 3.44 8 Does ot drive.44.56 6 4 5 We test the hypotheses: H : the two evets are ot associated H : The two evets are associated. ( fo fe) X fe ( 4.56) ( 7 3.44) ( 5.44 ) ( 7.56) + + + 4.56 3.44.44.56.874 +.943 +.78 +. 4.4 If the test is at the 5% level the the tables o page 45 of the MEI Studets Hadbook gives the critical value of 3.84 (v ). Sice 4.4 > 3.84 we reject the ull hypothesis, H, ad coclude that there is evidece that the two evets are associated. If the test were at the % sigificace level the we would coclude that there was ot eough evidece to reject the ull hypothesis.

Summary S Topic 4 Bivariate Data Pages 4-9 Pages - Bivariate Data are pairs of values (x, y) associated with a sigle item. e.g. legths ad widths of leaves. The idividual variables x ad y may be discrete or cotiuous. A scatter diagram is obtaied by plottig the poits (x, y ), (x, y ) etc. Correlatio is a measure of the liear associatio betwee the variables. A lie of best fit is a lie draw to fit the set of data poits as closely as possible. This lie will pass through the mea poit ( xy, ) where xis the mea of the xvalues ad yis the mea of the y values. There is said to be perfect correlatio if all the poits lie o a lie. Correlatio ad Regressio If the x ad y values are both regarded as values of radom variables, the the aalysis is correlatio. Choose a sample from a populatio ad measure two attributes. If the x value is o-radom (e.g. time at fixed itervals) the the aalysis is regressio. Choose the value of oe variable ad measure the correspodig value of aother. Notatio for pairs of observatios ( x, y). Sxy xi x yi y, S xi x xi x Syy yi y yi y The alterative form for Sxy is xiy i Sxy xiyi xiyi xy Example The legth (x cm) ad width (y cm) of leaves of a tree were measured ad recorded as follows: x..3.7 3. 3.4 3.9 y..3.4.6.9.7 The scatter graph is draw as show...5 y.. 3. 4. The mea poit is ( x, y) which is (.9,.5) The lie of best fit is draw through the poit (.9,.5) For the data above: S xy E.g. 5 studets are selected at radom ad their heights ad weights are measured. This will require correlatio aalysis. A ball is bouced 5 times from each of a umber of differet heights ad the height is recorded. This will require regressio aalysis. x 7.4; y 9.; xy 6.97 7.4 9. 6.97 Sxy.87 6 x Pages -4 Example 4. Page Exercise 4A Q. Pearso s Product Momet Correlatio Coefficiet provides a stadardised measure of covariace. Sxy r S. S yy xi x yi y xi x yi y The pmcc lies betwee - ad +. i i xy xy xi x yi y For the data above: x 5.76 S.3 y 3.9 S yy.4 r S xy.87 S S.3.4.885 yy r ca be foud directly with a appropriate calculator. Statistics Versio B: page 6 Competece statemets b, b, b3

Summary S Topic 4 Bivariate Data Pages 8-4 Exercise 4C Q. 3 Pages 3-34 Example 4.3 Page 34 Exercise 4C Q. 5 Statistics Versio B: page 7 Competece statemets b4, b5, b6, b7, b8 Pages 4-44 Exercise 4D Q. 4 Testig a paret populatio correlatio by meas of a sample where r has bee foud The value of r foud for a sample ca be used to test hypotheses about the value of ρ, the correlatio i the paret populatio. Coditios: (i) the values of x ad y must be take from a bivariate Normal distributio, (ii) the data must be a radom sample. A idicatio that a bivariate Normal distributio is a valid model is show by a scatter plot which is roughly elliptical with the poits deser ear the middle. H : ρ There is o correlatio betwee the two variables. H : ρ There is correlatio betwee Or : the two variables ( - tailed test.) H : ρ > There is positive correlatio betwee Or : the two variables (- tailed test.) H : ρ < There is egative correlatio betwee the two variables (- tailed test.) Spearma s coefficiet of rak correlatio If the data do ot look liear whe plotted o a scatter graph (but appear to be reasoably mootomic), or if the rak order istead of the values is give, the the Pearso correlatio coefficiet is ot appropriate. Istead, Spearma s rak correlatio coefficiet should be used. It is usually calculated usig the formula 6 di rs ( ) where d is the differece i raks for each data pair. This coefficiet is used: (i) whe oly raked data are available, (ii) the data caot be assumed to be liear. I the latter case, the data should be raked. Where r s has bee foud the hypothesis test is set up i the same way. The coditio here is that the sample is radom. Make sure that you use the right tables! Tied Raks If two raks are tied i, say, the 3rd place the each should be give the rak 3 /. The least squares regressio lie For each value of x the value of y give ad the value o the lie may be differet by a amout called the residual. If the data pair is (x i,y i ) where the lie of best fit is y a + bx the y i - (a + bx i ) e i givig the residual e i. The least squares regressio lie is the lie that miimises the sum of the squares of the residuals. S xy The equatio of the lie is y y ( x x) S For the data of Example : r.885 We wish to test the hypothesis that there is positive correlatio betwee legths ad widths of the leaves of the tree. H o : ρ There is o correlatio betwee the two variables. H : ρ > There is positive correlatio betwee the two variables (-tailed test). From the Studets Hadbook, the critical value for 8 at 5% level (oe tailed test) is.65. Sice.885>.65 there is evidece that H ca be rejected ad that there is positive correlatio betwee the two variables. Example judges raked 5 competitors as follows: Competitor A B C D E Judge 3 4 5 Judge 3 4 5 d -3 d 9 4 5 6x4 d 4 rs.3 5x4 For the data of example : r s.3 H o : ρ There is o correlatio betwee the two variables. H : ρ > There is positive correlatio betwee the two variables (-tailed test). For 5 at the 5% level ( tailed test), the critical value is.9. Sice.3 <.9 we are uable to reject H o ad coclude that there is o evidece to suggest correlatio. E.g. For the data x 3 4 5 y..4 3.6 4.7 6. x 5, y 7.9, x 55, xy 65 5 7.9 x 3, y 3.58 5 5 S x x 55 6 3 Sxy xy x y 65 5 3 3.58.3.3 y 3.58 ( x 3) y.3x+.9