1 Glossary Brase: Understandable Statistics, 10e A B This is the notation used to represent the conditional probability of A given B. A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability. A or B This represents the probability that one or both of events A or B occur. It can be calculated by taking the probability of A plus the probability of B minus the probability of both A and B. Addition rules for probability For mutually exclusive events, the probabilities can simply be added together. If the events can be cooccurring, you should subtract the probability of a co-occurring event which is otherwise counted twice (once for each event). alpha (type I error) α The probability of making a type I error is denoted by the Greek letter alpha, α. Alternate hypothesis H1 The alternate hypothesis is the statement you will adopt in the situation in which the evidence (data) is so strong that you reject the null hypothesis. A statistical test is designed to assess the strength of the evidence (data) against the null hypothesis. ANOVA An analysis of variance (ANOVA) allows us to compare several sample means. ANOVA requires that the groups are independent, randomly selected, and come from normally distributed populations with approximately the same standard deviation. Typically, the null hypothesis states that the groups all come from the same population and therefore have the same mean. Area under the standard normal curve There are extensive tables that show the area under the standard normal curve for almost any interval along the z axis. The areas are important because each area is equal to the probability that the measurement of an item selected at random falls in this interval. Areas under any normal curve To find areas and probabilities for a random variable x that follows a normal distribution, convert x values to z values. Arithmetic mean The arithmetic mean is often simply referred to as the mean. Average An average is a single number used to describe an entire sample or population. Back-to-back stem plot Back-to-back stem plots are used to compare two sets of data that share common stems. The stems are aligned vertically in a central column. The first set of leaves is displayed to the right, as in a regular stem-and-leaf display. The second set of leaves is displayed to the left, increasing outward. Bar graph In a bar graph, bars are of uniform width and uniformly spaced. The bars can be vertical or horizontal. The lengths of the bars represent values of the variable being displayed, the frequency of occurrence, or the percentage of occurrence. The same measurement scale is used for the length of each bar. Bar graphs should be well labeled. Bayes's theorem Bayes's theorem is an important relation for conditional probabilities that lets us calculate an unknown conditional probability based on other known probabilities. Bell-shaped curve The normal curve is also called a bell-shaped curve. beta (type II error) β The probability of making a type II error is denoted by the Greek letter beta, β. Bimodal distribution This term refers to a histogram in which the two classes with the largest frequencies are separated by at least one class. The top two frequencies of these classes may have slightly different values. This type of situation sometimes indicates that we are sampling from two different populations. Binomial coefficient The binomial coefficient represents the number of combinations of n distinct objects taken r at a time. Binomial experiment The central problem of a binomial experiment is to find the probability of r successes out of n trials. Each trial is independent and has one of two outcomes, success or failure.
2 Binomial probability distribution The binomial probability distribution can be used to compute the probability of r successes for any number of trials. To find the binomial distribution, take the probability of getting one outcome with r successes and n - r failures and multiply it by the number of outcomes that have r successes and n - r failures. Block A block is a group of individuals sharing some common features that might affect the treatment. Box-and-whisker plot A box-and-whisker plot is a visual representation a five-number summary. To create a box -andwhisker plot, draw a vertical scale to include the lowest and highest data values. To the right of the scale, draw a box from Q1 to Q3. Include a solid line through the box at the median level. Draw vertical lines, called whiskers, from Q1 to the lowest value and from Q3 to the highest value. confidence interval c the population mean. The value c is the proportion of confidence intervals, based on random samples of size n, that actually contain Categorical variable Sometimes qualitative variables are referred to as categorical variables. Census In a census, measurements or observations from the entire population are used. Central limit theorem The central limit theorem says that x can have any distribution whatsoever, but as the sample size gets larger and larger, the distribution of x-bar will approach a normal distribution. Chebyshev's theorem The data spread about the mean can be expressed generally for all distributions by Chebyshev's theorem. For any set of data (either population or sample) and for any constant k greater than 1, the proportion of the data that must lie within k standard deviations on either side of the mean is at least 1 minus the reciprocal of k squared. Chi-square distribution The chi-square distribution is non-symmetrical and varies depending on the degrees of freedom. Chi-square U, chi-square L Chi-square U is the value for the upper area of the curve (the right-tail), while chi-square L is the value for the lower area of the curve (the left-tail). Circle graph A circle graph is another name for a pie chart. Class boundaries There is a space between the upper limit of one class and the lower limit of the next class. The halfway points of these intervals are called class boundaries. Class frequency Examine each data value. Determine which class contains the data value and make a tally mark beside that class. The class frequency for a class is the number of tally marks corresponding to that class. Class lower limit The lower class limit is the lowest data value that can fit in a class. Class mark The class mark is another name for the class midpoint. Class midpoint The center of each class is called the midpoint (or class mark). The midpoint is often used as a representative value of the entire class. The midpoint is found by adding the lower and upper class limits of one class and dividing by 2. Class upper limit The upper class limit is the highest data value that can fit in a class. Class width The class width is the difference between the lower class limit of one class and the lower class limit of the next class. To find the class width, subtract the smallest data value from the largest data value. Divide the result by the desired number of classes, and increase the computed value to the next highest whole number. Cluster sample Divide the entire population into pre-existing segments or clusters. The clusters are often geographic. Make a random selection of clusters. Include every member of each selected cluster in the sample. Coefficient of determination r2 The coefficient of determination r2 is the square of the sample correlation coefficient r. It allows us to determine how good the least-squares line is as an instrument of regression. It is calculated as the ratio of explained variation over total variation.
3 Coefficient of multiple regression The coefficient of multiple determination allows us to determine how good a fit the least-squares regression is for a given set of data. The coefficient of multiple determination is a direct generalization of the concept of coefficient of determination between two variables. Coefficient of variation The coefficient of variation is used to express the standard deviation as a percentage of the sample or population mean. Calculate the coefficient of variation by dividing the standard deviation by the mean and multiplying the result by 100. Column total In a contingency table, the column total gives us the total number of data points that correspond with one of the variables. The column totals should sum to the number of total data points. Combinations rule The number of combinations of n objects taken r at a time is the permutations divided by r factorial where n and r are whole numbers and n is greater than or equal to r. Another commonly used notation for combinations is ncr. Complement of event A The complement of event A is the event that A does not occur. A probability and its complement sum to 1. Completely randomized design For one-way ANOVA, we have one factor. Different levels for the factor form the treatment groups under study. In a completely randomized design, independent random samples of experimental subjects or objects are selected for each treatment group. Completely randomized experiment A completely randomized experiment is one in which a random process is used to assign each individual to one of the treatments. Conditional probability Conditional probability is the probability that a dependent event will occur given that another event has occurred. Confidence interval for mean difference (standard deviations known) To find the confidence interval for a population mean difference with known standard deviations, first obtain two independent random samples from both populations. If you can assume that both population distributions are normal, any sample sizes will work. If you cannot assume this, then use sample sizes greater or equal to 30 for both populations. The confidence interval for the population mean is the difference in sample means, plus or minus the margin of error. To calculate the margin of error, take the first population variance divided by the first sample size, and add it to the second population variance divided by the second sample size. Take the square root of the calculated value and multiply by the critical value zc for the desired confidence level c. Confidence interval for mean difference (standard deviations unknown) When the population standard deviations are unknown, we turn to a Student s t distribution to find the difference in population means. The confidence interval for the population mean is the difference in sample means, plus or minus the margin of error. To calculate the margin of error, take the first sample variance divided by the first sample size, and add it to the second sample variance divided by the second sample size. Take the square root of the calculated value and multiply by the critical value tc for the desired confidence level c. When determining tc, use the degrees of freedom for the distribution with the smallest sample size. Confidence interval for p The confidence interval for p is the probability that p lies in the interval between p-hat minus the margin of error and p-hat plus the margin of error. Confidence interval for p1 - p2 The confidence interval for the difference of two binomial probability distributions is centered around the difference in p-hat values, plus or minus the margin of error. To calculate the margin of error, first multiply together the point estimates for success and failure, divided by the number of trials. Do this for both distributions and add them together to obtain the variance. Take the square root and multiply by the critical value to obtain the margin of error. Confidence interval for the population mean A c confidence interval for the population mean is an interval computed from sample data in such a way that c is the probability of generating an interval containing the actual value of the population mean. Confidence interval for the variance There are situations where we are interested in estimating the variability of a distribution rather than the expected value. Confidence level c The reliability of an estimate is measured by the confidence level. Suppose we want a confidence level of c. Theoretically, you can choose c to be any value between 0 and 1, but usually c is equal to a number such as 0.90, 0.95, or Confounding variable Two variables are confounded when the effects of one cannot be distinguished from the effects of the other. Confounding variables may be part of the study, or they may be outside lurking variables.
4 Contingency table with cells A contingency table is used to record the expected frequencies when comparing two factors. Each cell in the table corresponds to a specific combination of the two factors we are interested in measuring. Based on the null hypothesis, we should be able to pre-calculate the expected values for each cell in the table by making assumptions about the probabilities for each factor. Continuity correction Adjusting the values of discrete random variables to obtain a corresponding range for a continuous random variable is called making a continuity correction. If the discrete variable is a left point of an interval, subtract 0.5 to obtain the corresponding normal variable. If the discrete variable is a right point of an interval, add 0.5 to obtain the corresponding normal variable. Continuity correction for a p-hat distribution For a number of successes r and a total number of trials n, continuity correction can be used to convert a discrete p-hat distribution to a continuous x distribution. If r/n is the right endpoint of a p-hat interval, add 0.5/n to get the corresponding right endpoint of the x interval. If r/n is the left endpoint of a p-hat interval, subtract 0.5/n to get the corresponding left endpoint of the x interval. Continuous random variable A continuous random variable can take on any of the countless number of values in a line interval. Control chart If we are examining data over a period of equally spaced time intervals or in some sequential order, then control charts are especially useful. Control charts combine graphic and numerical descriptions of data with probability distributions. A control chart for a variable x is a plot of the observed x values in time sequence order. Control group In general, a control group is used to account for the influence of other known or unknown variables that might be an underlying cause of a change in response in the experimental group. Convenience sample Create a sample by using data from population members that are readily available. Correlation and causation The correlation coefficient is a mathematical tool for measuring the strength of a linear relationship between two variables. As such, it makes no implication about cause or effect. Correlation between averages The correlation between two variables consisting of averages is usually higher than the correlation between two variables representing corresponding raw data. One reason is that the use of averages reduces the variation that exists between individual measurements. Criteria for using normal approximation to binomial, np > 5 and nq > 5 For a distribution with a sufficiently large number of trials, the normal distribution can be used to approximate the binomial distribution. The number of trials multiplied by the probability of failure should be greater than 5. The number of trials multiplied by the probability of success should also be greater than 5, and this value can be used as the mean. Multiply together the number of trials, the probability of success, and the probability of failure to obtain the variance. Take the square root of the variance to get the standard deviation. Critical region The values of a distribution for which we reject the null hypothesis are called the critical region of the distribution. Depending on the alternate hypothesis, the critical region is located on the left side, the right side, or both sides of the distribution. Critical value Critical values are the boundaries of the critical region. Critical values are designated as z0 for the standard normal distribution. Critical values tc Critical values tc for a c confidence level indicate the values such that an area equal to c under the t distribution for a given number of degrees of freedom falls between -tc and tc. Critical values zc For a confidence level c, the critical value zc is the number such that the area under the standard normal curve between -zc and zc equals c. Cumulative frequency The cumulative frequency for a class is the sum of the frequencies for that class and all previous classes. Degrees of freedom Values of the variable t corresponding to what we call the number of degrees of freedom, abbreviated d.f. For the methods used in this section, the number of degrees of freedom is given by the formula d.f. = n - 1 where d.f. stands for the degrees of freedom and n is the sample size. Each choice for d.f. gives a different t distribution. Degrees of freedom d.f. for denominator for F distribution The degrees of freedom for the denominator of an F distribution is typically the total sample size across all groups minus the number of groups.
5 Degrees of freedom d.f. for numerator for F distribution the number of sample groups minus 1. The degrees of freedom for the numerator of an F distribution is typically Degrees of freedom d.f. for the chi-square distribution and tests of independence The degrees of freedom for a chisquare test of independence can be found by taking the number of rows - 1 and multiplying by the number of columns - 1. Degrees of freedom d.f. for the chi-squared distribution and goodness-of-fit test For a goodness-of-fit test, the number of degrees of freedom is the number of categories minus 1. degrees of freedom for testing population mean (population standard deviation unknown) If the standard deviation is unknown, you can still estimate the population mean by using a t distribution. If you can assume that your random variable is normally distributed, any sample size will work. Otherwise, be sure to choose a sample size greater or equal to 30. Use the sample size minus 1 to obtain the degrees of freedom and select a t distribution. degrees of freedom for testing the difference in population means when the population standard deviations are unknown If the standard deviations for the two distributions are unknown, you can still estimate the population mean difference by using a t distribution. If you can assume that your random variables are both normally distributed, any sample sizes will work. Otherwise, be sure to choose sample sizes greater or equal to 30. Use the sample size of the smaller distribution minus 1 to obtain the degrees of freedom and select a t distribution. Dependent events If events are dependent, the probability of one event depends upon the occurrence of the other event. Dependent samples Two sampling distributions are dependent if there is a relationship between corresponding data values in the two distributions. Paired data are an example of dependent samples. Descriptive statistics Descriptive statistics involves methods of organizing, picturing, and summarizing information from samples or populations. Discrete random variable A discrete random variable can take on only a finite number of values or a countable number of values. Dotplot A dotplot is somewhat similar to a histogram. In a dotplot, the data values are displayed along the horizontal axis. A dot is then plotted over each data value in the data set. Double-blind experiment This means that neither the individuals in the study nor the observers know which subjects are receiving the treatment. Double-blind experiments help control for subtle biases that an observer might pass on to a subject. EDA EDA stands for Exploratory Data Analysis. Empirical rule For a distribution that is symmetrical and bell-shaped (in particular, for a normal distribution): Approximately 68% of the data values will lie within one standard deviation on each side of the mean. Approximately 95% of the data values will lie within two standard deviations on each side of the mean. Approximately 99.7% (or almost all) of the data values will lie within three standard deviations on each side of the mean. Equally likely outcomes When outcomes are equally likely, the probability of an event is simply the number of favorable outcomes divided by the total number of outcomes. Error variation in ANOVA The error variation corresponds to the within-group variation of one-way ANOVA. Expected frequency of a cell, E In a contingency table, we might propose the null hypothesis that two factors are independent. In such a case, we can calculate the expected frequency of a cell by simply multiplying together the probabilities of each factor which is assumed to be independent. Computationally, this is equivalent to multiplying the row total for one factor by the column total for another factor, and dividing the result by the sample size. Expected value The mean of a probability distribution is often called the expected value of the distribution. The expected value is an average value and need not be a point of the sample space. Experiment In an experiment, a treatment is deliberately imposed on the individuals in order to observe a possible change in the response or variable being measured.
6 Explained variation Explained variation is defined as y-hat minus y-bar. It represents the difference between a base value y-bar and the least-squares line value y-hat. Explanatory variable In a scatter diagram, we call x the explanatory variable. Exploratory data analysis Exploratory data analysis techniques are particularly useful for detecting patterns and extreme data values. They are designed to help us explore a data set, to ask questions we had not thought of before, or to pursue leads in many directions. Extrapolation Predicting values of x values that are beyond observed x values in the data set is called extrapolation. Extrapolation may produce unrealistic forecasts. F distribution The F distribution can be used to test two population variances. The F distribution is skewed to the right, and its values are always greater than zero. It depends on two separate degrees of freedom, one for each of the populations being tested. F ratio The F ratio is the sample test statistic for the F distribution. It can be calculated as the ratio of sample variances. If two populations are hypothesized to be the same, then the F ratio should be approximately 1. Factor in two-way ANOVA In a two-way ANOVA model, the two variables are called factors. Factorial For a number n, its factorial is the product of n with each of the positive counting numbers less than n. By special definition, the factorial of zero is 1. Faulty recall Respondents may not accurately remember when or whether an event took place. Five-number summary The quartiles together with the low and high data values give us a very useful five-number summary of the data and their spread. Frequency Frequency is the number of times that a value appears within a set of data. Frequency distribution A frequency distribution reflects the way that values occur with varying frequency within a set of data. Frequency table A frequency table partitions data into classes or intervals and shows how many data values are in each class. The classes or intervals are constructed so that each data value falls into exactly one class. Gaussian distribution The normal distribution is sometimes called Gaussian after a mathematician who studied it, Carl Friedrich Gauss. Geometric mean When data consist of percentages, ratios, growth rates, or other rates of change, the geometric mean is a useful measure of central tendency. For n data values, multiply them together and take the nth root to calculate the geometric mean. This assumes all data values are positive. Geometric probability distribution A geometric probability distribution is used to calculate the probability that our first success comes on the nth trial. The probability for the nth trial is given by the probability of success multiplied by the probability of failure raised to the n-1 power. Goodness-of-fit test The goodness of fit test allows us to determine whether a population follows a specified distribution. In other words, we are testing the null hypothesis that a population fits a given distribution. For goodness-of-fit tests, we use a right-tailed test on the chi-square distribution. This is because we are testing to see if the chi-square measure of the difference between the observed and expected frequencies is too large to be due to chance alone. Harmonic mean When data consist of rates of change, such as speeds, the harmonic mean is an appropriate measure of central tendency. Sum together the reciprocals of each data value. Take the total number of values and divide by the computed sum to obtain the harmonic mean. This assumes no data value is 0. Hidden bias The question may be worded in such a way as to elicit a specific response. The order of questions might lead to biased responses. Also, the number of responses on a Likert scale may force responses that do not reflect the respondent s feelings or experience.
7 Histogram In histograms, we use bars to visually represent each class. The width of the bar is the class width, and the height of the bar is the class frequency. Homogeneity test A test of homogeneity tests the claim that different populations share the same proportions of specified characteristics. A test of homogeneity tests the claim that different populations share the same proportions of specified characteristics. This enables us to determine whether several populations share the same proportions of distinct categories. The computational processes for conducting tests of independence and tests of homogeneity are the same. The two main differences are the sampling method and the hypotheses. Hypergeometric probability distribution The hypergeometric distribution is a probability distribution of a random variable that has two outcomes when sampling is done without replacement. This is the distribution that is appropriate when the sample size is so small that sampling without replacement results in trials that are not even approximately independent. Hypotheses Hypotheses are assertions that you assume to be true for the purposes of investigation. Hypothesis testing Hypothesis testing is used to examine the validity of a hypothesis, such as the value of a parameter estimate. The central question in hypothesis testing is whether or not you think the value of the sample test statistic is too far away from the value of the population parameter proposed in the null hypothesis to occur by chance alone. Hypothesis tests about the variance There are situations where we are interested in testing variability of a distribution rather than the expected value, perhaps to find out whether variability increases or decreases given certain conditions. Tests of variance can be lefttailed, right-tailed, or two-tailed. Hypothesis tests about two variances It is sometimes useful to test the variances of two independent, normally distributed populations. The F-distribution can be used to test the null hypothesis that the populations share the same variance given a desired level of significance. Independence test An independence test is used to determine whether or not two factors are related to each other. This is often determined using a chi-square test. Independent events Two events are independent if the occurrence or nonoccurrence of one does not change the probability that the other will occur. Independent samples Two samples are independent if sample data drawn from one population are completely unrelated to the selection of sample data from the other population. Independent trials Trials are independent if the result of one trial has no effect on the results of other trials. Individuals Individuals are the people or objects included in the study. Inferential statistics Inferential statistics involves methods of using information from a sample to draw conclusions regarding the population. Inflection points The exact places on the normal curve where the transition between the upward and downward cupping occur are above the points one standard deviation away from the mean. In the terminology of calculus, transition points such as these are called inflection points. Interaction in two-way ANOVA In a two-way ANOVA model, be sure to test for interaction between the two factors. If you reject the null hypothesis of no interaction, then you should not test for a difference of means in the levels of the row factors or a difference of means in the levels of the column factors because the interaction of the factors makes interpretation of the results of the main effects more complicated. Interpolation Predicting values for x values that are between observed x values in the data set is called interpolation. Interquartile range The interquartile range is the difference between the third and first quartiles. Interval level The interval level of measurement applies to data that can be arranged in order. In addition, differences between data values are meaningful.
8 Interviewer influence influence responses. Factors such as tone of voice, body language, dress, gender, authority, and ethnicity of the interviewer might Law of large numbers In the long run, as the sample size increases and increases, the relative frequencies of outcomes get closer and closer to the theoretical (or actual) probability value. Leaf In a stem-and-leaf display, the rightmost part is called the leaf. Least-squares criterion One way to find a linear equation to represent a set of points in a scatter diagram is to use the least-squares criterion. This states that the sum of the squares of the vertical distances from the data points (x, y) to the line must be made as small as possible. Least-squares line y-hat = a + bx We use the notation y-hat = ax + b for the least-squares line. Algebra tells us that b is the slope and a is the intercept of the line. In this context, y-hat represents the value of the response variable y estimated using the least squares line and a given value of the explanatory variable x. Left-tailed test A statistical test is left-tailed if the alternate hypothesis states that the parameter is less than the value claimed in the null hypothesis. Level in two-way ANOVA In a two-way ANOVA model, the levels of a factor are the different values the factor can assume. Level of significance α The probability with which we are willing to risk a type I error is called the level of significance of a test. It is denoted by the Greek letter alpha, α. Levels of measurement These levels indicate the type of arithmetic that is appropriate for the data, such as ordering, taking differences, or taking ratios. Likert scale Sometimes survey respondents choose a number on a scale that represents their feelings from, say, strongly disagree to strongly agree. Such a scale is called a Likert scale. Linear combination of two independent random variables Let x1 and x2 be independent random variables, and let a and b be any constants. Then the new random variable W = ax1 + bx2 is called a linear combination of x1 and x2. Linear function of a random variable Let a and b be any constants, and let x be a random variable. Then the new random variable L = a + bx is called a linear function of x. Lurking variable A lurking variable is one for which no data have been collected but that nevertheless has influence on other variables in the study. The fact that two variables tend to increase or decrease together does not mean a change in one is causing a change in the other. A strong correlation between x and y is sometimes due to lurking variables. Main effects in two-way ANOVA In a two-way ANOVA model, the hypothesis regarding each separate factor is called a main effect. Margin of error The margin of error is the magnitude (i.e. the absolute value) of the difference between the sample point estimate and the true population parameter value. Margin of error for polls Some polls clarify the meaning of the margin of error further by saying that it is an error due to sampling. In most polls, the margin of error is given for a 95% confidence interval. Maximal margin of error The margin of error is the magnitude of the difference between the sample mean and the population mean. In most practical problems, the population mean is unknown, so the margin of error is also unknown. However, we can compute an error tolerance E that serves as a bound on the margin of error. Using a c% level of confidence, we can say that the point estimate differs from the population mean by a maximal margin of error. The maximal margin of error is zc multiplied by the population standard deviation and divided by the square root of the sample size. Mean An average that uses the exact value of each entry is the mean (sometimes called the arithmetic mean). To compute the mean, we add the values of all the entries and then divide by the number of entries. Mean for the binomial distribution The mean for a binomial distribution is the number of trials multiplied by the probability of success on a single trial.
9 Mean of a probability distribution The mean represents a central point or cluster point for the entire distribution. Mean of grouped data When data are grouped, such as in a frequency table or histogram, we can estimate the mean. For each class, multiply the class mean by the number of entries in that class. Take the sum of the computed values and divide by the total number of classes to obtain the mean. Mean of the p hat distribution The mean of the p-hat distribution is simply the probability of a successful outcome for a single trial. Mean of the x bar distribution The mean of the x-bar distribution is the same as the mean of the x distribution, denoted with the Greek letter mu, μ. Meaning of slope In the equation y-hat = ax + b, the slope b tells us how many units y-hat changes for each unit change in x. Median The median is the central value of an ordered distribution. To find the median, order the data from smallest to largest. For an odd number of data values, the median is the middle data value. For an even number of data values, take the sum of the two middle values and divide by two to obtain the median. Mode The mode of a data set is the value that occurs most frequently. Monotone relationship In a monotone relationship between variables x and y, y must always increase or always decrease as x increases. Mound-shaped symmetric distribution graph is folded vertically down the middle. This term refers to a histogram in which both sides are (more or less) the same when the MSBET, MSW The mean squares are the variance estimates needed for an ANOVA test. MSBET measures variance between groups, and MSW measures the variance within groups. The F-ratio test statistic can be obtained by dividing MSBET by MSW. Multinomial experiments A multinomial experiment is similar to a binomial, except that it accounts for more than two outcomes. To use a multinomial distribution, all trials must be independent, and outcomes must fall into a distinct category with the same probability for each trial. The test of independence and goodness of fit are both important in multinomial experiments. Multiple regression We have statistical methods for predicting one variable in terms of another single variable. However, we can improve the reliability of our predictions if we include more relevant data and corresponding random variables in the computation of our predictions. This is done using methods of multiple regression. Multiplication rule of counting The total number of possible outcomes for a sequence of events is the product of the number of possibilities for each event in the sequence. Multiplication rules of probability (for independent and dependent events) For independent events, the probabilities can simply be multiplied. For conditional events, the probability of the independent event is multiplied by the conditional probability of the dependent event. Multistage sample Use a variety of sampling methods to create successively smaller groups at each stage. The final sample consists of clusters. Mutually exclusive events Two events are mutually exclusive or disjoint if they cannot occur together. In particular, events A and B are mutually exclusive if P(A and B) = 0. Negative binomial distribution Given a number of successes k, where the kth success occurs on trial n, we can describe the probability distribution of n using the negative binomial distribution. When k is 1, this is the geometric probability distribution. Negative correlation If low values of x are associated with high values of y and high values of x are associated with low values of y, the variables are said to be negatively correlated. No linear correlation If the points of a scatter diagram are located so that no line is realistically a good fit, we then say that the points possess no linear correlation.
10 Nominal level The nominal level of measurement applies to data that consist of names, labels, or categories. There are no implied criteria by which the data can be ordered from smallest to largest. Nonparametric statistics Nonparametric methods require no assumptions about the population distributions from which samples are drawn. The obvious advantages of these tests are that they are quite general and not difficult to apply. The disadvantages are that they tend to waste information and tend to result in acceptance of the null hypothesis more often than they should. As such, nonparametric tests are sometimes less sensitive than other tests. Non-parametric test Non-parametric tests are useful when you cannot make assumptions about the shape or size of a population distribution. The disadvantage of non-parametric tests is that they are less sensitive in that they tend to accept the null hypothesis more often than they should. Nonresponse population. Individuals either cannot be contacted or refuse to participate. Nonresponse can result in significant undercoverage of a Nonsample error A nonsampling error is the result of poor sample design, sloppy data collection, faulty measuring instruments, bias in questionnaires, and so on. Normal approximation to the binomial distribution For a distribution with a sufficiently large number of trials, the normal distribution can be used to approximate the binomial distribution. The number of trials multiplied by the probability of failure should be greater than 5. The number of trials multiplied by the probability of success should also be greater than 5, and this value can be used as the mean. Multiply together the number of trials, the probability of success, and the probability of failure to obtain the variance. Take the square root of the variance to get the standard deviation. Normal curves The graph of a normal distribution is called a normal curve. Normal distributions One of the most important examples of a continuous probability distribution is the normal distribution. Normality indicators There are several indicators that can be used to determine if data have a normal distribution. A histogram of the distribution should be roughly bell-shaped. There should be less than one outlier above the third quartile or below the first quartile by greater than 1.5 interquartile range. Normal distributions are symmetric and should have a Pearson's index value between -1 and 1. In addition, a normal quantile plot of the data should have points close to a straight line. Null hypothesis H0 The null hypothesis is the statement that is under investigation or being tested. Usually the null hypothesis represents a statement of no effect, no difference, or, put another way, things haven t changed. Observational study In an observational study, observations and measurements of individuals are conducted in a way that doesn t change the response or the variable being measured. Observed frequency of a cell, O In a contingency table, the observed frequency is simply the number of actual observed data points that share the two factors being compared. Odds Odds are the ratio of an event divided by its complement. Ogive An ogive (pronounced oh ji ve ) is a graph that displays cumulative frequencies. One-way ANOVA A single-factor analysis of variance is called one-way ANOVA. Ordinal level The ordinal level of measurement applies to data that can be arranged in order. However, differences between data values either cannot be determined or are meaningless. Outlier Some data sets include values so high or so low that they seem to stand apart from the rest of the data. These data are called outliers. Outliers may represent data collection errors, data entry errors, or simply valid but unusual data values. Out-of-control signals A random variable x is said to be out of control if successive time measurements of x indicate that it is no longer following the target probability distribution. This can be used as a warning signal that a process is out of control. Paired data Paired data can be used when there is a natural matching of characteristics. For example, data pairs occur very naturally in before and after situations, where the same object or item is measured both before and after a treatment. Using matched or paired
11 data often can reduce the danger of introducing extraneous or uncontrollable factors into our sample measurements because the matched or paired data have essentially the same characteristics except for the one characteristic that is being measured. Paired data values (x,y) Studies of correlation and regression of two variables usually begin with a graph of paired data values (x, y). Parameter A parameter is a numerical measure that describes an aspect of a population. Parametric test A parametric test is a statistical test that requires certain assumptions such as a normal distribution or a large sample size. Pareto chart A Pareto chart is a bar graph in which the bar height represents frequency of an event. In addition, the bars are arranged from left to right according to decreasing height. P-Chart A P-Chart is a control chart for proportions r/n, where r is the number of successes out of a number of trials n. Pearson correlation coefficient The Pearson correlation coefficient is a mathematical measurement that describes the strength of the linear association between two variables, denoted by the letter r. Percentile There are 99 percentiles, and in an ideal situation, the 99 percentiles divide the data set into 100 equal parts. However, if the number of data elements is not exactly divisible by 100, the percentiles will not divide the data into equal parts. Perfect linear correlation If all the points in a scatter diagram lie on a line, then we have perfect linear correlation. In statistical applications, perfect linear correlation almost never occurs. Permutations rule For a number of choices n and a number of choices r, The number of ways to arrange in order n distinct objects, taking them r at a time, is n factorial divided by (n-r) factorial where n and r are whole numbers and n is greater than or equal to r. Another commonly used notation for permutations is npr. Pie chart In a circle graph or pie chart, wedges of a circle visually display proportional parts of the total population that share a common characteristic. Placebo effect The placebo effect occurs when a subject receives no treatment but (incorrectly) believes he or she is in fact receiving treatment and responds favorably. Point estimate for p, p-hat For a binomial distribution, p-hat is the number of successes divided by the number of trials. This can be used as a point estimate for p, the population proportion of successes. Point estimate for the population mean A point estimate of a population parameter is an estimate of the parameter using a single number. A sample mean is a point estimate of the population mean. Poisson approximation to the binomial For most practical purposes, the Poisson distribution will be a very good approximation to the binomial distribution provided the number of trials n is larger than or equal to 100, and the number of trials n multiplied by the probability of success p is less than 10. As n gets larger and p gets smaller, the approximation becomes better and better. Poisson probability distribution If we examine the binomial distribution as the number of trials n gets larger and larger while the probability of success p gets smaller and smaller, we obtain the Poisson distribution. The Poisson distribution applies to accident rates, arrival times, defect rates, the occurrence of bacteria in the air, and many other areas of everyday life. Pooled estimates of proportion, p-bar If two distributions are assumed to have the same proportion of successes, you can use a pooled best estimate, which is the sum of the observed number of successes for both trials divided by the sum of the total combined number of trials. Pooled standard deviation When there is reason to believe that two distributions have the same standard deviation, it is best to use a t distribution with a pooled standard deviation. The corresponding Student s t distribution has degrees of freedom equal to the sum of both sample sizes minus 2. Population correlation coefficient rho, ρ From a population of (x, y) pairs, we may be able to compute the population correlation coefficient if certain conditions are men. Specifically, the (x, y) are assumed to be representative of all possible (x, y) pairs, and both x
12 and y values should be normally distributed for their paired y and x values. We denote the population correlation coefficient using the Greek letter rho, ρ. Population data In population data, the data are from every individual of interest. Population mean, μ the population mean. The population mean is taken over the entire population. We use the lowercase Greek letter mu, μ, to represent Population parameters Population parameters are taken over an entire population instead of just a sample. When we see Greek letters used, we know the information given is from the entire population rather than just a sample. Population size The population size N is the number of all possible data values in the entire population. It is used to calculate the population mean, the population variance, and the population standard deviation. Population slope beta, β The population slope gives us the rate at which y changes per unit change in x. It is denoted by the Greek letter beta, β, and is part of the population least-squares equation y = αx + β. It is estimated by b in the equation yhat = ax + b. Population standard deviation If we have data for the entire population, we can compute the population standard deviation over all data values. Calculate the standard deviation by taking the square root of the population variance. Population variance If we have data for the entire population, we can compute the population variance over all data values. To find the sample variance, divide the sum of squares by the total number of elements in the population. Because we are using the entire population, we don't subtract 1 from the number of elements. Positive correlation The variables x and y are said to have positive correlation if low values of x are associated with low values of y and high values of x are associated with high values of y. Power of a test (1 - beta) The quantity 1 β is called the power of the test and represents the probability of rejecting the null hypothesis when it is, in fact, false. Probability distribution A probability distribution is an assignment of probabilities to each distinct value of a discrete random variable or to each interval of values of a continuous random variable. Probability of an event A, P(A) Probability is a numerical measure between 0 and 1 that describes the likelihood that an event will occur. Probabilities closer to 1 indicate that the event is more likely to occur. Probabilities closer to 0 indicate that the event is less likely to occur. P(A), read P of A, denotes the probability of event A. Probability of chance The P-value is sometimes called the probability of chance. Probability of failure In a binomial experiment, trials must result in a success or a failure. The probability of failure is simply 1 minus the probability of success. Probability of success In a binomial experiment, the probability of success during each individual trial is the same. P-value Assuming the null hypothesis is true, the probability that the test statistic will take on values as extreme as or more extreme than the observed test statistic (computed from sample data) is called the P-value of the test. The smaller the Pvalue computed from sample data, the stronger the evidence against the null hypothesis. Qualitative variable A qualitative variable describes an individual by placing the individual into a category or group, such as male or female. Quantitative variable A quantitative variable has a value or numerical measurement for which operations such as addition or averaging make sense. Quartile Quartiles are those percentiles that divide the data into fourths. The first quartile Q1 is the 25th percentile, the second quartile Q2 is the median, and the third quartile Q3 is the 75th percentile. Quota problem A quota problem uses a binomial distribution to find the number of trials n that provide for a specified number of successes at a given probability.
13 Random sample Use a simple random sample from the entire population. Random variable A quantitative variable x is a random variable if the value that x takes on in a given experiment or observation is a chance or random outcome. Randomization Randomization is used to assign individuals to the two treatment groups. This helps prevent bias in selecting members for each group. Randomized block design In a randomized block experiment, individuals are first sorted into blocks, and then a random process is used to assign each individual in the block to one of the treatments. When we block experimental subjects or objects together based on a similar characteristic that might affect responses to treatments, we have a block design. The use of blocks can account for some of the most important sources of variability among the experimental subjects or objects. In this way, differences among the treatment groups are more likely to be caused by the treatments themselves rather than by other sources of variability. Random-number table A random-number table contains pre-generated random numbers. From a random starting point in the table, simply read off a selection of random numbers. Range The range is the difference between the largest and smallest values of a data distribution. Rank-sum test The rank-sum test (also called the Mann-Whitney test) is a nonparametric method for testing the difference between the sample means of two independent random samples. To use the rank-sum test, first arrange the combined data points in point distributions in increasing order and assign each a rank. The sum of the ranks of the data points in the smaller sample can be used to calculate a sample test statistic to determine if the two sample distributions are the same. Ratio level The ratio level of measurement applies to data that can be arranged in order. In addition, both differences between data values and ratios of data values are meaningful. Data at the ratio level have a true zero. Raw score a z score. The raw score is the value of a random variable in a non-standard normal distribution. The raw score can be converted into Relative frequency The relative frequency of an event is its frequency divided by the number of total observations. Relative frequency of a class The relative frequency of a class is the proportion of all data values that fall into that class. To find the relative frequency of a particular class, divide the class frequency f by the total of all frequencies n (sample size). Relative-frequency histogram In relative-frequency histograms, we use bars to visually represent each class. The width of the bar is the class width, and the height of the bar is the relative frequency of that class. Relative-frequency table First make a frequency table. Then, for each class, compute the relative frequency f/n, where f is the class frequency and n is the total sample size. Replication Replication of the experiment on many subjects reduces the possibility that the differences between the two groups occurred by chance alone. Residual In a scatter diagram, the residual is another name for the unexplained deviation between the y value in a specified data pair (x, y) and the value predicted by the least squares line for the same x. Residual plot One way to assess how well a least-squares line serves as a model for the data is a residual plot. To make a residual plot, we put the x values in order on the horizontal axis and plot the corresponding residuals y = y-hat in the vertical direction. If the least-squares line provides a reasonable model for the data, the pattern of points in the plot will seem random and unstructured about the horizontal line at 0. Resistant measure A resistant measure is one that is not influenced by extremely high or low data values. Response variable In a scatter diagram, we call y the response variable. Right-tailed test A statistical test is right-tailed if the alternate hypothesis states that the parameter is greater than the value claimed in the null hypothesis.