Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1 and 2 Assignment #3: To replicate the classroom exercises. D.B. Khang _ HSRS #10 - Page 1 Objectives D.B. Khang _ HSRS #10 - Page 2 At the end of this lesson, you should be able to: Understand the role of statistical analysis in empirical research Use Excel and SPSS software in data manipulation and simplest statistical operations Be refreshed of the basic knowledge of probability theory to properly interpret the findings of statistical analysis 1
Statistical Analysis D.B. Khang _ HSRS #10 - Page 3 Data Information knowledge decisions and actions Statistical analysis: Set of scientific methods used to analyze the data in order to provide meaningful information for better understanding and decision making through An approximation of the real world Measurements of the errors of this approximation Based on the data available and the purposes, we may classify as Descriptive statistics: summarizing and presenting the (population or census) data in order: To provide insights To explain To assess and evaluate Inferential statistics: Analysis of data available (from a sample, and experiment, etc.) to draw conclusions on a larger or unseen group (population, future events, etc.) in order : To estimate and predict To test hypotheses To provide insights and To explain Types of data D.B. Khang _ HSRS #10 - Page 4 Non-metric (or qualitative) data: Nominal size of number is not related to the amount of the characteristic being measured Referring to names or attributes only Examples: brand, color, sex, professions, etc. Ordinal larger numbers indicate more (or less) of the characteristic measured, but not how much more (or less) Referring to ranking Examples: ranks, preferences, age groups, social classes, etc. Metric (or quantitative) data: Notes: Interval contains ordinal properties, and in addition, there are equal differences between scale points. Examples: temperature, date, index number, etc. Ratio contains interval scale properties, and in addition, there is a natural zero point Examples: length, counts, weight, sales, age, etc. Level of data is critical in determining the appropriate technique to use Statistics deals with all kinds of data, assuming that we enough of them 2
Storage of data for analysis D.B. Khang _ HSRS #10 - Page 5 Good storage of raw quantitative data is essential for meaningful manipulation, summary, presentation and analysis Most databases store data in format of table Rows are the data items or subjects Columns are the measurements or values assigned (collected) to the items: variables Data storage in most databases are transferable Basic data management skills to be developed through practices: Enter data into Excel and SPSS provide explanations of variables and scores Transfer data between these two platforms Calculate new variables from existing data entered Practical tips: Data should be coded numerically Full documentation (meanings of variables and their values) Consistency: data collection, storage and analysis Manipulations of data stored are acceptable but should be transparent Classroom exercise 1 Consider the data set HBAT.sav Read the description of the data and try to understand the meaning of the variables in the data set. Identify the metric and the non-metric variables, and the meanings of the values of the variables. D.B. Khang _ HSRS #10 - Page 6 Save the file into Excel file. Transfer the file back into SPSS data file. Try to reformat both files for better readability. 3
Summarizing and presenting data D.B. Khang _ HSRS #10 - Page 7 Most often, data should be summarized and presented in sensible ways that support our objectives (that is, to provide insights, to explain or to evaluate) Options usually include: Presenting summarized distributions: frequency tables, percentiles Using some measures of central tendency as representative statistics: averages, medians, modes Using some measures of variability: ranges, variances, standard deviations, inter-quartile ranges Using other descriptive statistics: min, max, quartiles, skewness, kurtosis, etc. Using tabulations and cross tabulations Using graphs and diagrams: line graphs, bar charts, pie charts, frequency diagrams, histograms, box plots and other statistical graphs Most of these can be supported by Excel and SPSS. Classroom exercise 2 D.B. Khang _ HSRS #10 - Page 8 Apply descriptive statistical tools of SPSS/Excel to the variables X 18 and X 19 of HBAT data set and interpret the results. Apply Pie chart to X 1, Histogram to X 19. Draw the scatter graph of X 18 and X 19 and interpret the results Draw the frequency tables of X 1 and X 2 and interpret the results Apply cross tabulation to X 1 and X 2 and interpret the results. Apply cross tabulation with two layers to X 1, X 3 and X 4 and interpret the results Copy the above tables into an Excel file for possible formatting 4
Classroom exercise 3 D.B. Khang _ HSRS #10 - Page 9 Create in Excel and SPSS a new variable: Z 19 = (X 19 μ )/σ where μ is mean of X 19 and σ is standard deviation of X 19 Apply descriptive statistical tools on Z 19 and interpret the results Draw the histogram charts of X 19 and Z 19 and interpret the results Note: Z 19 is called the standardized variable of X 19 Review of probability and distribution Probability: defined on random events (occurrences) Takes values between 0 and 1 Can be interpreted as limit of relative frequency (objective probability) Note: Often we may use also subjective probabilities, especially in decision making under uncertainty. Such probabilities simply mean the extent of our belief in the occurrence of uncertain events. However, most of statistics deals with objective interpretation based on random sampling of data! Random variable: output of a measurement (or survey question) that is taken out randomly from a given population. Usually we can have only sample values of the variables. Random variable can (only) be described by its distribution Distribution of a random variable can be approximated through observed values using summary statistics, histogram, frequency table or various charts Distribution of real random variables can also be approximated by theoretical distributions like normal, uniform, student, chi square, etc. Notation and examples Probability: P(customer is from magazine industry) = 0.52 Random variable: X 19 = customer satisfaction score Combined: P(X 19 >= 7.8) =? D.B. Khang _ HSRS #10 - Page 10 5
A small challenge D.B. Khang _ HSRS #10 - Page 11 A two-headed coin, a two-tailed coin and an ordinary coin are placed in a bag. One of the coins is drawn at random and flipped; it comes up head. What is the probability that there is a head on the other side of this coin? Solution: There are 6 sides of which 3 sides are Head: one from the normal coin and 2 from the two-head coin. Call them H1, H2 and H3. Each side has equal chance to come up If you see H1, the other side is Tail; if you see H2 or H3, the other side will be head. Once you see head, the probability is 2/3 to see H2 or H3. 6