CEB - Basic Data Aalysis Priciples Basic Data Aalysis Priciples What to do oce you get the data Whe we reaso about quatitative evidece, certai methods for displayig ad aalyzig data are better tha others. Superior methods are more likely to produce truthful, credible, ad precise fidigs. The differece betwee a excellet aalysis ad a faulty oe ca sometimes have mometous cosequeces. -Edward R. Tufte, Visual ad Statistical Thikig: Displays of Evidece for Makig Decisios Visual Explaatios, Edward R. Tufte, Graphics Press, 1997. Uit III - Module 6 1 Ackowledgmets ICEAA is idebted to TASC, Ic., for the developmet ad maiteace of the Cost Estimatig Body of Kowledge (CEBoK ) ICEAA is also idebted to Techomics, Ic., for the idepedet review ad maiteace of CEBoK ICEAA is also idebted to the followig idividuals who have made sigificat cotributios to the developmet, review, ad maiteace of CostPROF ad CEBoK Module 6 Basic Data Aalysis Priciples Lead authors: Mega E. Damero, Bethia L. Cullis, Mauree L. Tedford Seior reviewers: Richard L. Colema, Jessica R. Summerville, Joh S. Smuck, Fred K. Blackbur Reviewers: Samuel B. Toas, Kevi Cicotta, Matthew J. Pitlyk, Bria A. Welsh Maagig editor: Peter J. Braxto Uit III - Module 6 1
CEB - Basic Data Aalysis Priciples Uit Idex Uit I Cost Estimatig Uit II Cost Aalysis Techiques Uit III Aalytical Methods 6. Basic Data Aalysis Priciples 7. Learig Curve Aalysis. Regressio Aalysis 9. Cost ad Schedule Risk Aalysis.Probability ad Statistics Uit IV Specialized Costig Uit V Maagemet Applicatios Uit III - Module 6 3 Data Aalysis Overview Key Ideas Visual Display of Iformatio Cetral Tedecy of Data Dispersio (Spread) of Data Data accumulatio Outliers Aalytical Costructs Descriptive statistics Mea, media, mode Variace, std deviatio, CV Fuctioal forms Practical Applicatios Makig sese of your data Related Topics Parametrics Distributios Normal, Chi, t, F Probability ad Statistics 3 Uit III - Module 6 4
Frequecy Frequecy Frequecy CEB - Basic Data Aalysis Priciples Past Uderstadig your historical data Data Aalysis Withi The Cost Estimatig Framework Preset Developig estimatig tools Future Estimatig the ew system 3 Mothly Gas Bill 3 Mothly Gas Bill 3 Mothly Gas Bill 3 3 3 1 1 1 1 3 4 6 7 9 More $ 1 3 4 6 7 9 More $ 1 3 4 6 7 9 More $ Historical data Mea = $34.19 Average cost Cofidece Iterval = +/-$.76 Cofidece Itervals Uit III - Module 6 Data Aalysis Outlie Core Kowledge Types of Data Uivariate Data Aalysis Scatter Plots Variables Axes ad Fuctio Types Data Validatio Descriptive Statistics Outliers Rules of Thumbs Two Cautioary Tales Summary Resources Related ad Advaced Topics Uit III - Module 6 6 3
Cost Frequecy CEB - Basic Data Aalysis Priciples Types of Data Uivariate Bivariate Multivariate Time Series Uit III - Module 6 7 1 Uivariate Types of Data Sigle variable Use descriptive ad iferetial statistics Bivariate Oe idepedet variable ad oe depedet variable (i.e., y is a fuctio of x) Use descriptive ad iferetial statistics Multivariate Liear Tred Several idepedet variables ad oe depedet variable (i.e., y is a fuctio of x 1, x, ad x 3 ) 14. 13. 1. Use descriptive ad iferetial statistics 3 3 1 6. 4. 4 3. 3. 3 4 6 7 1.. Mothly Gas Bill Weight 1. 3. 4. 6. 7. 9. More $ Tip: Uivariate data plus a Nomial variable is really bivariate 9.. 1 3 4 6 7 S1 Uit III - Module 6 4
CEB - Basic Data Aalysis Priciples Types of Data Time Series Time as the idepedet variable Iterval matters! Make sure you use a XY (Scatter) ad ot a Lie Chart i Excel uless itervals are equally spaced Smooth treds are rarely foud i time series Possible rare exceptios (e.g., corrosio over time) Stadard treds such as ivestmet ad iflatio Look for paradigm shifts, cycles, autocorrelatio Use movig averages, divide data ito groups ad 11 compare descriptive statistics Regressio is ofte ot useful as it oly picks up smooth treds uless AR1/ARIMA.4. ANOVA ad mea comparisos are more useful Uit III - Module 6 9 3. 3.6 3.4 3. 3..6 3 4 Uivariate Data Aalysis Visual Display of Iformatio Histogram, stem-ad-leaf, box plot Measures of Cetral Tedecy Mea (or media or mode) Measures of Variability Stadard deviatio (or variace), coefficiet of variatio (CV) Measures of Ucertaity Cofidece Iterval (CI) Statistical Tests Tip: This aalysis framework is mirrored i bivariate ad multivariate aalysis. Uit III - Module 6 How precise are you? What does it look like? t test, chi square test, Kolmogorov-Smirov (K-S) test What s your best guess? How ca you be sure? How much remais uexplaied?
Frequecy Frequecy Frequecy CEB - Basic Data Aalysis Priciples Visual Display - Histograms 6 Histograms should be used to give a idea of the distributio of the data 3 3 1 Mothly Gas Bill 1. 3. 4. 6. 7. 9. More $ Warig: Results of macros do ot update if your data chage! Excel Data Aalysis Add-I Histogram. Tip: Create histogram maually usig Chart type Colum so that results do update whe data chage! Skew-right distributio, possibly Expoetial, Triagular, or Logormal Uit III - Module 6 11 Histograms Bis 4 4 3 3 1 It is importat to carefully cosider the umber of bis used i a histogram Experimet with itervals to be sure you uderstad the data 11.1 1.19 Mothly Gas Bill 31.4 41. 1.3 Warig: Default bis i Excel histograms may ot be optimal! $ This histogram allows Excel to choose the bis 61.36 71.41 1.4 More 3 3 1 Which is clearer? Which sets a trap? Uit III - Module 6 1 Mothly Gas Bill 1. 3. 4. 6. 7. 9. More $ This histogram specifies the bis. 6 Warig: Histograms ca be maipulated! 6
CEB - Basic Data Aalysis Priciples Cetral Tedecy - Mea 6 The sample mea of the data set {x 1, x,, x } is calculated as: x xi 1 x1 x... x i I Excel, use the AVERAGE( ) fuctio Meas of example data sets: Gas bill (74 moths), $6. Therms used (74 moths), 14. The mea is the Expected Value of a radom variable Uit III - Module 6 13 Cetral Tedecy - Media The sample media is the middle data poit, with % of the remaiig observatios fallig uder that poit, ad % above If a data set has a odd umber of poits, the middle value is the media The media of the data set {,,7,9,} is 7 AKA th Percetile If a data set has a eve umber of poits, the two middle values are averaged The media of the set {3, 6,, 11, 13, 3} is 9. (average of ad 11) I geeral, the k th percetile is the poit with k% of the data below ad (-k)% of the data above Quartiles (,, 7), deciles (,,,, 9), icosatiles (,, 1,, 9) Whe there are extreme data poits, the media may be more represetative tha the mea because robust outliers impact the mea more tha the media Represetative is a descriptive term, ot a mathematical term There are may mathematical reasos to prefer mea over media Uit III - Module 6 Leged Red = Extreme poits 14 Blue = Middle poits 7
Frequecy CEB - Basic Data Aalysis Priciples Mea, Media, ad Skew The mea ad the media are equal if the distributio is symmetric Uequal meas ad medias are a idicatio of skewess 1 19.6. Logormal Distributio Media = Mea Symmetric. Beta Distributio.4 1..3 1..1 1 3 3 Normal Distributio.4.3...4.6. 1 1.. Media < Mea Skew(ed) Right.1. -4-3 - -1 1 3 4 http://e.wikipedia.org/wiki/skewess Media > Mea Skew(ed) Left Uit III - Module 6 1 Cetral Tedecy - Mode The sample mode is the most frequet poit to occur i a data set The mode of a distributio is its peak Value with the greatest probability mass (or desity) The mode of the set {,4,4,7,9,9,9} is 9 The mode is a descriptive metric aswerig the questio what happes most frequetly? 3. 3. 1. 1. 3 4 6 7 9 X Value It ca help give a visual idea of what the distributio looks like Most useful i discrete data A histogram shows that the value 9 occurs most ofte this is the mode Uit III - Module 6 16 16
CEB - Basic Data Aalysis Priciples Variability Variace / Stadard Deviatio The sample variace measures the deviatio of the data poits from their mea easy to remember s i1 ( x i x) 1 i1 I Excel, use the VAR( ) fuctio xi i1 1 The sample stadard deviatio is simply The stadard deviatio is expressed i the same uits as the origial data I Excel, use the STDEV( ) fuctio x i Uit III - Module 6 17 Tip: Low variace idicates less dispersio, i.e., tighter data easy to calculate s s Tip: s is the estimator for the populatio parameter σ Variability - Coefficiet of Variatio 13 The Coefficiet of Variatio (CV) expresses the stadard deviatio as a percet of the mea CV Large CVs idicate that the mea is a poor estimator Cosider regressio o cost drivers Examie data for multiple populatios (outliers) CVs of example data sets: Gas bill, 74.4% (69.%) Therms used, 4.% (.%) s X Uit III - Module 6 1 Tip: Low CV idicates less dispersio, i.e., tighter data. 1% or less is desired Note that sums ad averages ted to have smaller variaces 9
Frequecy Frequecy CEB - Basic Data Aalysis Priciples Dispersio ad CV These two data sets have the same mea, but differet stadard deviatios 1 Lower CV This data has a higher CV (3%) ad has more dispersio.. 4.9.9 71. 6. More Bi This data has a lower CV (17%) ad is more tightly distributed 14 1 6 4 Higher CV.. 4.9.9 71. 6. More Bi Uit III - Module 6 19 Cofidece Iterval Illustratio A cofidece iterval (CI) suggests to us that we are (1-a)*% cofidet that the true parameter value is cotaied withi the calculated rage* x t s, x ta / a /, 1,, 1 s * Note this statemet provides a geeral sese of what a cofidece iterval does for us i cocise laguage, for ease of uderstadig. The specific statistical iterpretatio is that if may idepedet samples are take where the levels of the predictor variable are the same as i the data set, ad a (1-a) % cofidece iterval is costructed for each sample, the (1-a) % of the itervals will cotai the true value of the parameter. a/ a/ 1 - a critical values Uit III - Module 6
CEB - Basic Data Aalysis Priciples Sample Sizes - Sufficietly Large 6 4 I geeral, we prefer to be large how large is a fuctio of our tolerace for error The 6.3% CI for the mea is roughly CV/ So, for CVs ragig aroud 3%, we get the followig 6.3% Cofidece Iterval with : +/- 4 1% 9 % 16 % 6% 36 % Tip: 3 is ot a magic umber of data poits If we would like to be able to make judgmets withi about % poits with a CV of 3%, we eed 36 We may have o choice but to deal with small I ay case, we ca calculate the rage of estimated mea Uit III - Module 6 1 Predictio Itervals The previous cofidece iterval illustratio gives the true average cost withi a certai rage If we wat to kow the predicted cost of a ew item withi a certai rage, we eed a predictio iterval The PI suggests to us that we are (1-a)*% cofidet that the ext observatio will be cotaied withi the calculated rage The larger stadard error i the PI accouts for both the ucertaity i the mea (captured by the CI) ad the ucertaity i idividual observatios x t s 1 1, y 1, x t /, 1s a /, 1 a 1 1 Uit III - Module 6 11
CEB - Basic Data Aalysis Priciples Statistical Tests 17 t test for mea Is the Cost Growth Factor (CGF) for NAVAIR programs differet tha 1.? Chi square test for variace Is 3% a reasoable CV to use for this variable? Should t test for equal meas assume equal variaces? Chi square test for distributio Are Lie-Replaceable Uit (LRU) failures uiform across all deployed uits? Kolmogorov-Smirov test for distributio Is the ormal distributio appropriate for modelig ucertaity i desig weight? Uit III - Module 6 3 Scatter Plots Variables Axes Fuctio Types Uit III - Module 6 4 1
Light Ship Displacemet CEB - Basic Data Aalysis Priciples Scatter Plots A picture is worth a thousad words! A scatter plot ca reveal a wealth of iformatio about relatioships preset i the data Create scatter plots i Excel by usig the Chart Wizard XY (Scatter) Add a tred lie i Excel by right clickig the plotted data ad choosig Add Tred lie Helps lik graph ad equatio Look at iferetial statistics later Tip: Scatter plots are the sigle most useful tool i all of aalysis they are the gift of sight to the aalyst 1,,, 6, 4,, - y = 7.9x - 476 R =.77 19 196 197 19 199 Year Uit III - Module 6 1 Scatter Plots Variables Plot cost (or other variable of iterest, e.g., hours) as the depedet variable Look at a variety of differet idepedet variables Techical parameters such as weight, lies of code, etc. Performace parameters such as speed, accuracy, etc. Operatioal parameters such as crew size, flyig hours, etc. Cost of aother elemet Thik about which variables you believe should drive cost ad collect that data! Uit III - Module 6 6 13
Cost Cost Cost CEB - Basic Data Aalysis Priciples Scatter Plots Cost Drivers Scatter plots ca help idetify cost drivers R iterpretatio: % of variatio i y explaied (liearly) by variatio i x 13 1 y =.37x + 66.337 R² =.47 13 1 y =.7x + 7. R² =.4 1 3 Warig: R is just 9 7 4 9 14 Sigificat correlatio potetial cost driver Variable 1 13 1 1 9 a idicator, cosult t ad F statistics! y = -.9x + 1. R² =.41 7 4 9 14 Variable 3 1 9 7 4 9 14 Variable Weak correlatio Ucorrelated Uit III - Module 6 7 Scatter Plots Uit Space Data should first be plotted i uit space* x is plotted o the horizotal axis (x-axis) ad y is plotted o the vertical axis (y-axis) If the data have a o-liear relatioship whe plotted i uit space, ivestigate how the data ca be made liear No-liear relatioships ca ofte be trasformed to appear liear through the use of atural logs Trasformed data ca the be regressed liearly Before the widespread use of computers, oliear data was graphed o semi-log or log-log paper * Uit space refers to the origial, utrasformed data. Uit III - Module 6 14
Cost CEB - Basic Data Aalysis Priciples Scatter Plots Liear Fuctio The most commo relatioships are liear Of the form y = mx + b [m = slope, b = y-it.] Plotted i uit space 11 6. 4. 4 3. 3. y =.7461x +.9 R =.3 Liear Tred 3 4 6 7 Weight Tip: Liear models are also the best approximatios to o-liear models by which we mea, they take you least far afield if you guessed wrog. Uit III - Module 6 9 Scatter Plots Power Fuctio Power fuctios are of the form y = ax b Ca be trasformed ito liear fuctios Takig the atural log of both sides gives l(y) = l(a) + b l(x) Plot l(x) o the horizotal axis ad l(y) o the vertical axis ad look for a liear tred This trasformatio is show graphically o the ext slide Uit III - Module 6 3 1
l (Cost) Cost CEB - Basic Data Aalysis Priciples 3 Scatter Plots Power Fuctio 1 7 1 16 14 1 6 4 This fuctio is most commoly used for learig curves, but ca also be used for CERs Cost =.394Weight.14 R =.99 Power Tred 3..6.4. 1. 3 4 6 1.6 7 Weight 1.4 1. 1 Tip: Aother virtue of tred lies is that they ca act as a Rosetta Stoe for the values of a curve fit o trasformed variables. Uit III - Module 6 31 Slope o log-log graph is the expoet of the power equatio Power Tred o Log-Log Axes l (Cost) =.14l (Weight) -.993 R =.99 1 1. 1.4 1.6 1. l (Weight) A alterative is to use Format Axis Logarithmic scale Scatter Plots Expoetial Fuctio Expoetial fuctios are of the form y = ae bx = a(e b ) x = ak x Models of this form ca be trasformed ad made to be liear Takig the atural log (l) of both sides gives l(y) = l(a) + bx The atural log (l) is the iverse fuctio of the expoetial: y = e x x = l(y) Tip: Expoetial fuctios are seldom ecoutered i cost estimatio outside of iflatio Uit III - Module 6 3 7 16
l (Cost) Cost CEB - Basic Data Aalysis Priciples Scatter Plots Expoetial Fuctio 16 14 1 6 4 The, x is plotted o the horizotal axis ad l(y) is plotted o the vertical axis This trasformatio is show graphically below Expoetial Tred Cost =.4e 1.19Weight R² =.46 3 4 6 7 Weight. 4. 4 3. 3. 1. Slope o semi-log graph is the coefficiet of x i the expoetial equatio Expoetial Tred o Semi-Log Axes l(cost) = 1.19Weight - 3.69 R² =.46 1 3 4 6 7 Weight Uit III - Module 6 33 Scatter Plots Costat Terms Geeralized power ad expoetial equatios are of the form: y = ax b + c, y = ae bx + c Power ad expoetial models usually assume a costat term of c = However, c = is more commo i theory tha i practice If c = does ot fit the data, cosider usig a model with c Use the Excel Add-i Solver (or aother, more robust optimizatio tool) to fit a curve to the data, where a, b, c are chose simultaeously (GERM) Miimize SSE or maximize uit-space R To b or Not to b The y-itercept i Cost Estimatio, R. L. Colema, J. R. Summerville, P. J. Braxto, B. L. Cullis, E. R. Druker, SCEA, 7. Warig: Excel forces power ad expoetial tredlies to have c =! Uit III - Module 6 34 17
CEB - Basic Data Aalysis Priciples Data Validatio Scatter plottig gives you a idea of the relatioships preset i the data What s ext? Look at descriptive statistics Look for outliers Compare to historical studies, idustry stadards, or rules of thumb Uit III - Module 6 3 Descriptive Statistics Calculate descriptive statistics for each data group Sample size Raw mea Stadard deviatio Coefficiet of variatio (CV) Warig: Results of macros do ot update if your data chage! Tip: Create formulae maually so that results do update whe data chage! 14 Weighted averages (e.g., dollar-weighted) Movig averages (for time series data) I Excel, Tools Data Aalysis Descriptive Statistics will easily calculate the most importat descriptive statistics 11 Uit III - Module 6 36 1
$ Wtd DE CGF CEB - Basic Data Aalysis Priciples Descriptive Statistics - Bar Charts Bar charts ca be used to compare the descriptive statistics for differet groups Y-error bars ca be added to show the stadard deviatio Tip: Stadard deviatios are useful, but predictio itervals would be better, capturig the iteractio of quatity ad dispersio more 3. succictly ad i a iferetially better way. Be sure 3.to label which they are. RDT&E Programs by Compay (SAR Programs with EMD oly) 9...7 1. 1. 1.3 1. 1.4 1... Co. 1 Co. Co. 3 Co. 4 Co. = 3 = 7 = 4 = 4 = 9 Uit III - Module 6 37 Bar Charts i Excel Bar charts Excel Chart Wizard Colum Chart Y-error bars Format Data Series Y-error bars (3) Chart Tools Layout Aalysis Error Bars (7) Histogram Excel Data Aalysis Add-I - Histogram Tip: It is recommeded that you create your ow dyamic histograms with flexible bi spacig usig COUNTIF() ad Colum Charts. Uit III - Module 6 3 19
Cost CEB - Basic Data Aalysis Priciples Outliers 6 Outliers are data poits that fall far away from the ceter of the data ad are ot represetative of the populatio you are tryig to model For ormally distributed data sets, about 9.4% of the data should fall withi two stadard deviatios of the mea So, we d expect 4.% to be outside two stadard deviatios 99.7% of the data should be withi three stadard deviatios of the mea If a data poit is more tha three stadard deviatios from the mea, it is a potetial outlier Tip: The ormal distributio is a good first approximatio, but if your data are sigificatly skewed, these rules of thumb should ot be used to idetify potetial outliers. Uit III - Module 6 39 6 7 Outliers ad Tred Lies 7 6 4 3 4 3 1 Outliers may bias the regressio lie y Cost vs. Weight All Data ad Potetial Outlier Removed y = 1.31x + 1.39 R² =.166 y = 1.793x + 19.33 R² =.76 1 Weight Possible outlier, 4.4 stadard deviatios from the mea Tip: If usig two graphs, do ot chage scale of axes whe comparig! Uit III - Module 6 4 Without the possible outlier, the slope of the regressio lie is steeper ad the R is higher
CEB - Basic Data Aalysis Priciples Removig Outliers Do ot remove a outlier from the data without a good reaso! Doig so removes some of the variatio preset i history Doig so ca be a form of cookig the data Good reasos for removig a outlier: Program was restructured or divided Oe of these is ot like the others e.g., a helo i a set of missile data Bad reasos for removig a outlier: Too high stadard deviatios away from the mea [!] Uit III - Module 6 41 Tip: Outlier treatmet separates the aalysts from the spi meisters 4 Rules of Thumb Compare your descriptive statistics to historical rules of thumb NCCA Stadard Factors hadbook, for example Saity check! Tip: Compariso to history ad cross checks separates the thorough from the sloppy Uit III - Module 6 4 1
CEB - Basic Data Aalysis Priciples Two Cautioary Tales Expert s Eyeball Descriptive Statistics ad Visual Displays Techical Huch Outliers Uit III - Module 6 43 Egieerig Judgmets 14 Suppose we are give a estimate that has egieerig judgmet as its basis Egieerig judgmets should ever be accepted without validatio! The aalyst must fid out if the guess is correct, or at least i the ballpark Experts ofte possess isight or ituitio regardig systems that bears o cost, but it is the aalyst s job to make the estimate explicit ad reproducible Uit III - Module 6 44
Percet of First Ship CEB - Basic Data Aalysis Priciples Example: Expert s Eyeball Follow Ship Support Hull FF DDG 37 Average 7.1% 4.%.9% 3 9.4%.1% 7.% 4 9.% 4.3% 6.7% - - 4.% Is the average a good idea? Is the th ship guess right? Uit III - Module 6 4 Example: Expert s Eyeball The average is a good umber! 3 1 Decrease i Follow-Ship Support Whe the average lie is extrapolated, it looks like th ship should be about 6% Tip: Graphic esures cost estimate credibility! FF DDG 37 Average 3 4 Hull The th ship guess of 4% looks too low! Uit III - Module 6 46 3
CEB - Basic Data Aalysis Priciples Example: Techical Huch I this real-life example, we will look at the importace of correctly ivestigatig outliers Scatter plots ca be extremely useful i idetifyig potetial outliers Uit III - Module 6 47 Example: Techical Huch Shakedow Hull Hours/To DD 963.9 DD 9.4 DD 93.3 DD 96.1 DD 963 is too low for a first ship Uit III - Module 6 4 4
Hours per To CEB - Basic Data Aalysis Priciples Wrog Outlier Rejected! Istead of DD 963, look ito DD 9 That s the potetial outlier!.4.4.3.3...1.1. This lie produces a more reasoable th ship estimate. Hours/To 96 96 97 97 9 9 99 Hull Hours/To The expert s curve is urealistic at the th ship! Uit III - Module 6 49 Data Aalysis Summary Steps of basic data aalysis 1. Scatter plot visual depictio of the relatioships i the data. Descriptive statistics calculate the meas ad CVs If the CV is uder 1%, the average may be a sufficiet predictor, focus more attetio o elemets with higher CVs If the CV is over 1%, focus o this elemet usig regressio aalysis to look for a better predictor tha the average (CER developmet) 3. Look for outliers (data quality check) 4. Compare to history Uit III - Module 6
CEB - Basic Data Aalysis Priciples Resources A Itroductio to Mathematical Statistics ad Its Applicatios, 3 rd ed., Richard J. Larse ad Morris L. Marx, Pretice Hall, Probability ad Statistics for Egieerig ad the Scieces, th ed., Jay L. Devore, Brooks/Cole Publishig, 1999 Calculus: Sigle Variable, Deborah Hughes-Hallett ad Adrew Gleaso, Joh Wiley & Sos, 199. How to Lie with Statistics, Darrel Huff, W.W. Norto & Compay, 194 The Visual Display of Quatitative Iformatio, Edward R. Tufte, Graphics Press, 193 Evisioig Iformatio, Edward R. Tufte, Graphics Press, 199 Visual Explaatios, Edward R. Tufte, Graphics Press, 1997 Beautiful Evidece, Edward R. Tufte, Graphics Press, 6 Uit III - Module 6 1 Related ad Advaced Topics Visual Display of Iformatio Additioal Graph Types for Uivariate Stem-ad-Leaf Boxplots Bi Width ad Number Rules Mea - Metal Math Trick Sample Sizes Cofidece Itervals CI Simplified Sufficietly Large Rules of Thumb Outlier Idetificatio Rules Uit III - Module 6 6
Hours CEB - Basic Data Aalysis Priciples Visual Display of Iformatio Poor visual displays of iformatio hider uderstadig Excel s default scatter plot is ot a oe-size-fitsall iformatio display Quick fixes esure a graph ca truly give the gift of sight Use evocative colors to your advatage Size matters Make sure the graph fills the space the data is the mai evet! Check the scale Choose a fot size Check the placemet of the leged Two possible displays follow Uit III - Module 6 3 Visual Display of Iformatio Excel Default Visual Disply Example 9 7 6 4 3 4 6 1 14 Uit Series1 Series Series3 Series4 Uit III - Module 6 4 7
Hours- Thousads CEB - Basic Data Aalysis Priciples Visual Display of Iformatio Aother Display Visual Display Example 9 7 7 6 6 4 4 1 3 4 6 7 9 11 1 Series1 Series Series3 Series4 Uit Uit III - Module 6 Stem-ad-Leaf Plots 1 3 4 6 Similar to a histogram Horizotal umbers istead of vertical bars Example: Therms of atural gas used 333333333333334444444444444444666779999 7 1113666679 3339 46 Uit III - Module 6 6 Mode = 4 therms
CEB - Basic Data Aalysis Priciples Box Plots 9 Lower Fourth Media Upper Fourth Data poit betwee 1. f s ad 3 f s from the Upper Fourth Potetial Outlier: Data poit more tha 3 f s from the Upper Fourth 4 6 1 14 16 1 Lowest data poit withi 1. f s of the Lower Fourth f s = Upper Fourth Lower Fourth = Iterquartile Distace Highest data poit withi 1. f s of the Upper Fourth Uit III - Module 6 7 Box Plots Applicatio Box plots ca be used to: Show the ceter, spread, ad symmetry of the data Idetify outliers A sample box plot is show o the previous slide, ad a real-world oe below: 3 4 6 7 9 Media Bill = $16. Mothly Gas Bill ($) Uit III - Module 6 9
CEB - Basic Data Aalysis Priciples Bi Widths ad Number Rules Various Rules for Bi Width (h) or Number of Bis (k) based o Number of Data Poits (), Sample Stadard Deviatio (s), ad Iterquartile Rage (IQR) Bi Width (h) Number of Bis (k) Assumptios Commets Square Root Rule Max x i Mi k x Sturges Rule Maxx i Mi xi k i Scott s Rule 3.s Maxx i Mix i 3 h Freedma Diacois Rule IQR Maxx i Mix i 3 h - 3 log 1 Normal distributio Normal distributio Used i Excel Data Aalysis Histogram tool Used by DAU Reasoable default if data ot too skewed Modifies Scott s Rule by focusig o IQR istead of s NEW! Uit III - Module 6 9 Mea Metal Math Trick X The mea ca also be a arbitrary umber plus the average of the deviatios from that umber: X i i1 i1 X X i * X * * * X i X X i X * X i1 * i1 X Mothly average therms used data: {37, 6, 13, 3, 3, 3, 3, 3, 4, 7, 1, 4} Average = + (7+16+3-7-7-7-7-7-6-3+11+3)/1 = + 43/1 = 13.6 Uit III - Module 6 6 3
CEB - Basic Data Aalysis Priciples Sample Sizes Cofidece Iterval How big a sample size do we eed so that a 6.3% Cofidece Iterval (oe stadard deviatio) about the estimate is +/-% of the estimate? i.e., there is 6.3% probability that the populatio mea is withi % of our estimated mea. Cosider the cofidece iterval for the mea of a ormal distributio x t s x t s Note that the size of the rage aroud the estimate of the mea is a fuctio of: the variability, captured by stadard deviatio, s, or coefficiet of variatio, CV the sample size, a /, 1,, a /, 1 Uit III - Module 6 61 Note: we are assumig a ormal distributio for simplicity Sample Sizes CI Simplified Istead of workig with stadard deviatios, we would like to shift to CVs CVs are uit-less ad more ituitive (expressed i percets) So, divide the rage by x s x ta /, 1 x t x a /, 1 1 ta /, 1 x CV s Uit III - Module 6 6 This shifts the rage ito percets. The rage is relative to % of the estimate t a /, 1 CV 31
CEB - Basic Data Aalysis Priciples Sample Sizes Sufficietly Large What sample size is eeded for judgmets withi %? For a 6.3% two-tailed CI, we have a = 1 -.63 = 31.7% ad thus a/ = 1.9% 1.9% 1.9% Suppose we have a CV of 3% 6.3% CV t.19, +/- 4 3% 1.14167 17% 9 3% 1.7 11% 16 3% 1.34 % CV 3% 1.44 t 6% a /, 1 36 3% 1.143 % We would like to be able to make judgmets withi about % poits, so with a CV of 3%, we eed 36 t a/,-1 Note: for a 9% CI we would use a =.. The t multipliers would vary from.7 to.3 Uit III - Module 6 63 Sample Sizes Rule of Thumb For a easy rule of thumb, we ca just roud the t value to t = 1 CV The, we use simply CV t.19, Exact +/- Thumb rule 4 3% 1.14167 17% 1% 9 3% 1.7 11% % 16 3% 1.34 CV % % 3% 1.44 ta /, 1 6% 6% 36 3% 1.143 % % Tip: For a 6.3% CI, use CV. For a 9% CI, use CV. Uit III - Module 6 64 3
CEB - Basic Data Aalysis Priciples Outlier Idetificatio Rules Rule Outlier(s) Iff Ratioale Chauveet s Criterio Grubbs Test Dixo s Q Test x x 1. s 1 G, t a / a / Gap/Rage > (critical value from table), where Gap = distace betwee outlier ad its closest eighbor t Uit III - Module 6 6 Normal distributio properties Normal distributio properties, where x x G Max s Uclear. Will ot detect two approximately equal outliers. IQR-Based x ot i the iterval Ca customize k based o choice of distributio, α, ad. For Q 1 kq3 Q1, Q3 kq3 Q1 example, i a ormal distributio, k = 3 implies that < % of poits should fall outside the rage., NEW! 33