Kerala School of Mathematics Course in Statistics for Scientists Introduction to Data Analysis T.Krishnan Strand Life Sciences, Bangalore What is Data Analysis Statistics is a body of methods how to use numbers to elucidate rather than to mislead Statisticians work in many areas probability, exploratory data analysis, modelling, social policy, decision making, and others two fundamental tasks: description and inference Description involves characterizing a batch of data in simple but informative ways, including graphically Inference involves generalizing from a sample of data to a larger population of possible data Descriptive statistics help us to observe more acutely Inferential statistics help us to formulate and test hypotheses ts Steps in a Statistical Study Exploration of Data plan the study understand background and collect questions and issues collect data check the data for errors explore the data review the initial questions generate hypotheses and build statistical models analyze residuals and review hypotheses and models interpret and make recommendations clean and sanitize the data check validity of the values check for missing values and deal with them check for outliers and deal with them understand the data make tables, charts, graphs check if there are groups in the data make transforms if needed check for standard assumptions ts
Descriptive Measures Graphical Representation of Data what measures to use depends on data and purpose mean, median, mode, variance, standard deviation, range, inter-quartile range, etc. depends on nature of distribution symmetric, skewed, outliers, tails (light or heavy), etc. shall discuss in the Descriptive Statistics presentation what graphs to use depend on nature of data and purpose careful not to mislead graphs before, during, and after data analysis Graphs before: visual representation of data and summaries: e.g. bar chart Graphs during: check assumptions and model fit : normal prob plot Graphs after: check assumptions; e.g., normal probability plot of residuals present results: parameter estimates, say in log-linear models ts Reasonable Graph Overemphasized Graph ts
Inferential Statistics want to do more than describe the sample generalize, formulate a policy, or test a hypothesis, to make an inference, to classify, to predict inference implies that we think a model describes a more general population from which our data have been randomly sampled when you make inferences, you should have a population in mind finite and infinite populations Population, Sample, Statistical Inference to use inferential methods to estimate the mean age of India s population on 1 April 2001 could enter all N ages into a SYSTAT file and compute the mean age exactly If practical, this is the preferred method census method sometimes, a sampling estimate can be more accurate than an entire census biases are introduced into large censuses from refusals to comply, keying or coding errors, and other sources a carefully constructed random sample can yield less-biased information about the population it is the analyst s responsibility to ensure that the sample is representative of the larger group (population) on all attributes that might affect the results more on this in Survey Sampling ts Computing Aids and Statistical Analysis Types of Statistical Software Data Analysis with hand computation mechanical calculator electronic calculator with functions electronic computer electronic computer with subroutine packages menu-driven software packages ts Level 1: Excel, MatLab, StatGraphics, Statview limited statistical features; menu-driven; easy to learn and use Level 2: JMP, SPSS, MINITAB, SYSTAT, STATISTICA statistical software packages; more comprehensive features; menu-driven with command-line windows; moderate cost Level 3: SAS, S-PLUS statistical software packages; for expert users; command-line driven; very comprehensive; sophisticated features; very expensive Level 4: R freeware; command-driven; a somewhat steep learning curve Level 5: BUGS, MRBAYES statistical software packages for specialized uses base module + optional add-on modules or toolboxes Many softwares have simpler less expensive or free
Using a Software When using a software for data analysis: Don t be blind to the data set Formulate the issues to be resolved Examine assumptions Analyse by alternative methods Investigate methods suitable Examine the software Use computer-intensive methods Exploratory Data Analysis ts Data Files Data Cleaning be aware that almost every data set is likely to be polluted errors, incompleteness, and other inadequacies especially those data sets obtained or imported from different sources some of the common sources of errors: typing errors or data entry errors coding errors measurement errors missing values detection of errors correction of errors missing value imputation detection of outliers (elimination?) finding groups (lack of homegeneity) need for transformations ts
Data Cleaning Tools Descriptive Statistics cross tabulation bar charts descriptive statistics graphical displays box plots (outliers) density plots (mixture of groups) ts Crosstabulation Outliers ts
Mixtures Transformations ts