ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/ 1 Statistical Data Analysis Statistical data analysis usually consists of three steps: 1. Data Preparation: After reading in raw data, we need to clean and prepare the data for analysis, i.e. check values, select variables of interest and (possibly) construct new variables. 2. Explorative Data Analysis: Visualization of data (scatterplots, boxplots, barplots etc.) and computing characteristic values (mean, min, max or standard deviation). This step is very important, to get an overview over the data and to recognize irregularities. 3. Inferential Statistics: Testing hypothesis Examples: t-test, regression, analysis of variance, survival analysis Be careful: All statistical methods have assumptions, which should be satisfied in order to trust the test result. 2 Getting Started with SPSS 2.1 Menus At first a short overview over the menus: File: Data files are read in (Open, Open Database, Read Text Data) and later you can Save or Print the data or the output Edit: Inserting new variables / cases or searching for them (Insert Variable, Insert Cases, Find, Go to Case, Go to Variable) and all kinds of other SPSS Options (font, font size, pivot tables, language etc.) View: Visualization of the data in the data view: Grid Lines, Value Labels Data: Data preparation (Sort Cases, Restructure, Select Cases) 1
Transform: Construct new variables (Compute Variables) or modify existing ones (Recode into Same Variable) Analyze: Statistical methods Graphs: Graphics: interactive (Chart Builder) and static (Legacy Dialogs) Windows Management of SPSS windows Help: Topics (search), Tutorials (manual), Case Studies (Explanation of statistical menus and interpretation of outputs), Command Syntax Reference (all syntax commands). Manuals can be found on the homepage of SPSS or via Google. 2.2 Data View, Variable View When SPSS is started, it first shows up a data table: the so-called Data Editor. The Data Editor consists of two sheets: Data View and Variable View. For demonstration, first download the file demo.sav from the course homepage. We can open this SPSS data file (Ending:.sav) via File / Open / Data. The file contains an artificial data set from a company, which sends monthly advertisements to potential costumers. The variable response denotes if a costumer reacts to the promotion. Further, there are a lot of personal and demographic information about the costumers. The Data View shows the data. It is the default view. SPSS is row orientated like Excel, i.e. there is one observation per row and every column represents a variable. The Variable View shows information about the variables (variable type, length etc.). In every row one variable and its properties are listed: Name: Special characters aren t allowed in names (e.g. exclamation marks, question mark or space) Type: Type of variable: numeric (numbers), date, dollar, string (words) are the most important types Width: Maximum number of letters (only useful with string variables) Decimals: Maximum number of decimals (only useful with numeric variables) Label: Full name of variable (name for graphics) Values: Here you can save the coding of factors (e.g. gender: 0 = male, 1 = female). Missing: How did you code missing values? SPSS treats by default only empty cells as missing. If you coded missings e.g. with 99 you have to specify this here Columns: Width of column Align: Alignment of content in cells (right, left, centered) 2
Measure: There are three measures for variables: scale (continuous variables like age or body weight), nominal (categorical variables, e.g. treatment groups) and ordinal (categorical variable with natural ordering like income class or age groups). In both sheets information about your data can be added, modified and deleted. After reading in, you should always check if all information in the variable view is correctly specified. This is very important for the statistical analysis. 2.3 Output Window The output window functions as logbook. Every action and analysis result is printed in form of a table, plot or syntax. The output can be edited (Menu Insert): you can add titles, change fonts etc. At the end of an analysis you can save the output (File / Save). Tables can be exported to Excel or as.pdf; similarly graphics can be exported as.jpg or.pdf (Right click and Export). 3 Reading Data Very often data is not directly stored as.sav file (the data format of SPSS). You may have saved your data in Excel, as text file or in a data base format. Therefore, we now discuss how to read in data stored in a non-spss format. 3.1.dat or.txt Data File / Read Text Data... sprintbiometr.dat First you select the directory where the data is saved. Be aware that SPSS shows by default only.sav files. Thus you have to change the file type to All Files. A dialog window opens. Step-by-step SPSS inquires the the structure of the data in the file, e.g. Are the variable names stored in the first row of the file? and Which delemeter is used between columns?. A preview window helps to decide whether you entered the properties of the file correctly. It is important that every column represents one variable and each row one observation. 3.2.xls Data File / Open... ozon.xls In the second dialog enter the name of the Excel worksheet and the range of columns and rows which you want to analyze in SPSS. There are some rules for.xls data if you do not want to encounter any problems. Variables names should only be placed in the first line of the Excel worksheet. The first line of data should start directly after the line with the variables names. The use of formulae in the worksheet is discouraged. 3
3.3 Enter Data in SPSS It is also possible to enter your data directly in SPSS: File / New / Data After entering you can save it by File / Save....sav If you enter data directly in SPSS, you have to specify all the variable properties manually. There is a dialog window, which simplifies this task: Data / Define Variable Properties 4 Data Preparation 4.1 Menu Data When the raw data is entered in SPSS, the data should be prepared for analysis. SPSS provides several tools in Menu Data. Example: sprintzeit.sav Sort Cases Sorts observations according to names and run Data / Sort Cases name, lauf Restructure Restructure is a very powerful tool, which can create new columns out of rows and rows out of columns. It is particularly useful for repeated measures data. Before restructuring, the data should be saved! Example of restructure: the variable time and the index variable run should be restructured into two variables which show the time of run 1 and run 2. Data / Restructure Restructure selected cases into variables identifier: name index: lauf no further options Merge Files With this tool data from several files can be combined. For example, there could be one file with measurements on pollution and one file with measurement on meteorologic data. In order to combine the files correctly, you need a variable which appears in both files, e.g. ID or date. Example in Exercise. Aggregate The goal of aggregating data is to summarize information on subgroup level. Therefore, characteristic values are calculated for the observation level according to subgroups. The new data file will only contain information on subgroup level. For the aggregation SPSS provides several functions like mean, standard deviation, max, min etc. Example: Aggregate the sprint data for boys and girls. We calculate the average sprint time. (sprint.sav) 4
Data / Aggregate break = sex summaries: zeit1, zeit2 function = mean create a new data set: aggr.sav Split File Split File allows to analyse the data separately for groups. The output is sorted by groups. Example: Calculation of mean split for girls and boys Data/ Split File... Organize Output by group Analyze / Descriptive Statistics / Frequencies... Statistics: mean If the option split is used, you can see split in the right bottom corner of SPSS s Data Editor. All analysis will be split by gender as long as you do not remove this split: Data/ Split File... Analyze all cases Select Cases Select data according to some conditions. Example: We want to compute the mean only for persons older than 16. Data/ Select Cases... if condition is satisfied / alter < 16 To select the matching cases SPSS computes a filter variable (0 = not selected, 1 = selected), which will be added as new column in your data. Furthermore, all non-selected observations are crossed in the Data View. If you want to sort out single observations, you can use the function $Casenum. Assign Weights to Cases Weighting of observations. Example in Exercise: Chi-Squared Test. 4.2 Menu Transform Compute New variables can be constructed. Example: Average time or best running time (sprint.sav): Transform/Compute Variable.../meanTime = (zeit1+zeit2)/2 Transform/Compute Variable.../minTime = min(zeit1,zeit2) Recode With recode you can rename the levels of existing categorical variables or transform a scale variable to a categorical one. Example: Construct a new variable Speed Cat with three levels (<13, 13-14, > 14). 5
Transform / Recode into different variables Input: meantime Output: Speed Cat Old Value: <13 New Value: Fast Old Value: 13-14 New Value: Medium Old Value: > 14 New Value: Slow Visual Binning Transformation of a scale variable into a categorical variable. The ranges for the new categories can be determined by hand (histogram) or with predefined functions like quantiles and fixed interval length. Example: see Exercise 5 Descriptive Analysis and Graphics If we work with nominal or ordinal measured variables, the variables can be nicely summarized in frequency tables. Example: Frequency table of sex (sprint.sav) Analyze/Descriptive Statistics/Frequencies... Variables: Sex Charts: Simple Bar Typical characteristic values for scale variables are mean, standard deviation and quantiles. You can find them also in the Analyze/Descriptive Statistics Menu. Analyze/Descriptive Statistics/Frequencies... Variables: Alter, zeit1, zeit2 Statistics: mean, var, quartile Remove cross: Display frequency table Another typical question of descriptive analysis is about correlation. For example, we want to analyze whether there is any correlation between time1 and time2. Be careful with the interpretation of a correlation without graphic illustration. On the course slides there are examples for various point clouds of very different shapes - but all have a Pearson correlation of 0.7. Therefore, we also draw a scatterplot of time1 vs. time2. In SPSS there are two ways to produce graphics: either you can use the new interactive graphic menu or the old legacy dialog. Interactive Menu: Graphs/Chart Builder The Gallery shows a preview of various graphics, which can be generated in that menu. The preview does not show the real data. The true plot you only see after finishing the layout in the output window of SPSS. Drag a graphic from the gallery to the chart preview on top (here: scatterplot/dots) Drag the variables from the left side into the chart preview and Place them on x or y axis (here: x = zeit1 and y = zeit2) 6
In addition to the main window another window appears, which is called Element Properties. Here you can change the bar style, the limits of axis etc. Legacy plots: Graphs / Legacy Plots / Scatter/Dot / Simple Scatter X-Axis: zeit1 Y-Axis: zeit2 Edit graphics: The graphic can be modified at any time with a double click. A chart editor with a lot of options opens. In addition, there are many more options available after double clicking an element (Element = x-axis, points, lines, titles etc.). For every element a separate property window opens. The options are different for single elements. The chosen element is yellow marked in the chart editor. After closing the chart editor all your modification are used to update the original graphic in the output window. Example: double click on x-axis: we can change the thickness, style and color of the line. The graphic can be exported as.jpg or.pdf by right click and Export. After examining our two variables graphically, we now calculate the correlation: Analyze / Correlate / Bivariate... Variables: zeit1, zeit2 7