Appendix III: SPSS Preliminary

Appendix III: SPSS Preliminary SPSS is a statistical software package that provides a number of tools needed for the analytical process planning, data collection, data access and management, analysis, reporting and deployment. The SPSS program is started in different ways depending on the installation. Typically, it is started from Windows by simply double-clicking on the appropriate icon or by choosing it from a menu of options. SPSS is a window-based software that consists of three files Data Editor, Syntax Editor, and Output Viewer. The Data Editor is typically the starting point for everything you do in SPSS for Windows. It is a spreadsheet-style interface in which you enter your data and specify the names of your variables. Once you have done this, you can proceed to the subsequent steps of specifying the statistical analysis you want. There are two ways of carrying out statistical analysis. One of ways is to use mouse to open menus and dialog boxes and to choose options from them. The other way involves using a syntax editor in SPSS. You start by first opening a new window, called a Syntax Editor, and then writing commands in the SPSS programming language that specify one of more analyses. Once your commands have been typed into the Syntax Editor, you need to tell SPSS to execute or run them. If there is only one command to run, make sure that the curse is located somewhere on the command you want to execute by using the mouse and clicking somewhere on the command. Then locate the Run button on the tool bar near the top of the screen. It is the button with a right-pointing arrow on it. If there are more than one command in the Syntax Editor, you can run several commands at once by first highlighting the commands and then click on the Run button to execute the highlighted commands. Another option to execute a command is to use the menu bar labeled with Run. Click on Run, and a pull-down menu presents you with several options, including All (if you want to run all the commands without highlighting the commands), Selection, or Current (if you want to run only the current command).

Once the data have been entered and the analysis has been specified using one of two ways, a new window (Output Viewer) appears containing the results of your analyses. 1.1: Reading Text Data Files When you have a date set like household surveys, the first thing you need to do is to convert the raw data into SPSS, STATA, or any other software you are using. Here we explain the basic steps to convert the raw data in a simple text file (standard ASCII format) to SPSS format. (i) (ii) (iii) (iv) (v) (vi) Step 1 is to select Read Text Data from the File menu. Select a text data file to read. This opens the Text Wizard. The data file is displayed in the preview window. In the first step you can apply a predefined format (previously saved in the Text Wizard). In this example, you simply click Next since you want to define a new format. Step 2 provides information about variables. A variable is similar to a field in a database. For example, each item in a questionnaire is a variable. Fixed format means each variable is recorded in the same column location for every case. Delimited means that spaces, commas, tabs, or other characters are used to separate variables. The variables are recorded in the same order for each case but not necessarily in the same column locations. Step 3 provides information about cases. A case is similar to a record in a database. For example, each respondent to a questionnaire is a case. Step 4 displays the Text Wizard's best guess on how to read the data file and allows you to modify the way the Text Wizard will read variables from the data file. Step 5 controls the data format that the Text Wizard will use to read each variable and which variables will be included in the final data file. Step 6 is the final step of the Text Wizard. You can save your specifications to read similar text data files. And you can also paste and save the underlying

command syntax. When you are ready to read the text data file, just click Finish. An example of the Syntax Editor in converting a raw data into SPSS is as below: SET BLANKS=SYSMIS BLANKS=SYSMIS UNDEFINED=WARN. DATA LIST FILE='c:\lfs142.dat' FIXED RECORDS=1 TABLE /1 reg 1-1 cwd 2-3 area 4-4 blk 5-7 year 8-10 hh_no 11-12 hh_ type 13-13 no_hh_me 14-15.... weight 136-142(2). EXECUTE. 1.2: Opening a Dataset You open a SPSS dataset by selecting Open from the File menu. Alternatively, you can open a dataset by typing the following command in the Syntax Editor: GET FILE = 'C:\INTROPOV\DATA\Hh.SAV' The command shown above specifies the current directory and the data file saved. Unlike STATA, there are no such commands or menus as Set memory 20m (when we have to allocate more memory to a big dataset in STATA) or Set matsize 100 (when we have to allocate more variables to STATA) 1.3: Saving a Dataset When you save a dataset file, you choose Save or Save as option from the File menu. You can also write a command in syntax SAVE OUTFILE = 'C:\INTROPOV \ DATA \ Hh.SAV'

When you make some changes in the data file and want to save it, you simply click Save from the File menu. This will simply overwrite the existing file with the new file. If you want to keep the original file, you select Save as command from the File menu. 1.4: Exiting SPSS To finish SPSS, you simply select Exit option from the File menu. On choosing exit command, you will be asked whether or not you want to save the changes you made to the open data file. If you want to save to overwrite the open file with the original file, then click yes and otherwise no. 2: Working with data files: looking at the content 2.1: Listing the variables To see all variables in the data set, select File Info from the Utilities menu. It provides information on each variable s name, label, type, missing values, and measurement level in the output file. Another option is to click the Variable View tab on bottom of the spreadsheet. 2.2: Defining and Labeling data SPSS has an option that enables us to define whether a variable is numeric or string. To define the type of a variable, first click the Variable View tab. This displays the Variable View. Click the Type cell in the row for the variable, and then click the button in the cell. Select the data type in the Define Variable Type dialog box. For example, to enter data values that contain letters, select String. Then click OK.

In addition to defining data type, you can also define descriptive variable and value labels for data values. These descriptive labels are used in statistical reports and charts. For example, you could assign the labels 'Urban' and 'Rural' to the numeric values 1 and 2. To define a variable label, click the Label cell in the row for the variable. And enter the descriptive label. To specify Value Labels, make the Data Editor the active window. If the Data view is displayed, double-click the variable name at the top of the column in the Data view or click the Variable View tab. Click the button in the Values cell for the variable you want to define. For each value, enter the value and a label. Click Add to enter the value label. And click on OK when you are finished. If you want to command lists of households with a certain characteristics, for instance, households headed by a female who is younger than 45, what do you have to do? The command in the Syntax Editor is USE ALL. COMPUTE filter_$ = (sex = 2 & age < 45). VARIABLE LABEL filter_$ 'sex = 2 & age < 45 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. EXECUTE. More specifically, Select Cases from the Data menu provides several methods for selecting a subgroup of cases based on criteria that include variables and complex expressions. You can also select a random sample of cases. The criteria used to define a subgroup can include: variable values and ranges date and time ranges

case (row) numbers arithmetic expressions logical expressions functions For unselected cases, you can filter or delete cases that do not meet the selection criteria. Filtered cases remain in the data file but are excluded from analysis. Select Cases creates a filter variable, FILTER_$, to indicate filter status. Selected cases have a value of 1; filtered cases have a value of 0. Filtered cases are also indicated with a slash through the row number in the Data Editor. To turn filtering off and include all cases in your analysis, select All cases. Deleted cases are removed from the data file and cannot be recovered if you save the data file after deleting the cases. The above commands use both relational and logical operators. The operators are somewhat similar to ones used in STATA. One exception is that while SPSS uses = as equal to, STATA uses = = indicating the same meaning. Most conditional expressions use one or more of the six relational operators (<, >, <=, >=, =, and ~=) on the calculator pad. Conditional expressions can include variable names, constants, arithmetic operators, numeric and other functions, logical variables, and relational operators. 2.3: Summarizing data The Descriptives procedure displays univariate summary statistics for several variables in a single table and calculates standardized values (z scores). Variables can be ordered by the size of their means (in ascending or descending order), alphabetically, or by the order in which you select the variables (the default). When z scores are saved, they are added to the data in the Data Editor and are available for charts, data listings, and analyses. When variables are recorded in different units (for example, gross domestic product per capita and percentage

literate), a z-score transformation places variables on a common scale for easier visual comparison. If we want to compute the household size and education of the household head, then the following command can summarize the results. DESCRIPTIVES VARIABLES=size edu /STATISTICS=MEAN STDDEV MIN MAX KURTOSIS. Under this command, the related statistics include sample size, mean, minimum, maximum, standard deviation, variance, range, sum, standard error of the mean, and kurtosis and skewness with their standard errors. Like STATA, SPSS allows us to use weights. The command for weighting by population is WEIGHT BY POP. Weight Cases gives cases different weights (by simulated replication) for statistical analysis. The values of the weighting variable should indicate the number of observations represented by single cases in your data file. Cases with zero, negative, or missing values for the weighting variable are excluded from analysis. Fractional values are valid; they are used exactly where this is meaningful, and most likely where cases are tabulated. Once you apply a weight variable, it remains in effect until you select another weight variable or turn off weighting. If you save a weighted data file, weighting information is saved with the data file. You can turn off weighting at any time, even after the file has been saved in weighted form. Unlike STATA program, SPSS does not have options on selecting different weights among analytic weight, sampling weight, frequency weight, and importance weight. Note that when dealing with cross-sectional household surveys, weighting is important to correct the

sampling design differences and data collection problems. The results are also very different as shown below. Summary Statistics (without weighting) N Minimum Maximum Mean Std. Kurtosis Std. Error Deviation Size of 24747 1.00 7.00 3.4906 1.6050 -.512.031 household Education of head 24747 1.00 4.00 2.3962.8397 -.233.031 Summary Statistics (with weighting by population) N Minimum Maximum Mean Std. Kurtosis Std. Error Deviation Size of 62385025 1.00 7.00 4.3027 1.5788 -.740.001 household Education of head 62385025 1.00 4.00 2.2343.7309 1.191.001 In some cases, we need to know summary statistics by many sub-groups. For example, we want to see mean of family size and education of household head by regions in a country. In this example, SPSS does not require the data set to be sorted out by region. Select first Analyze, Compare Means, and then Means from the menu. The command to get the result is as follows: MEANS TABLES=region BY edu BY size /CELLS MEAN COUNT STDDEV SEMEAN MIN MAX VAR. The Means procedure calculates subgroup means and related univariate statistics for dependent variables within categories of one or more independent variables. Optionally, you can obtain a one-way analysis of variance, eta, and tests for linearity. Statistics provides sum, number of cases, mean, median, grouped median, standard error of the mean, minimum, maximum, range, variable value of the first category of the grouping variable, variable value of the last category of the grouping variable, standard deviation, variance, kurtosis, standard error of kurtosis, skewness, standard error of skewness, percentage of total sum, percentage of total number of

observations, percentage of sum in, percentage of number of cases included in the anaysis, geometric mean, harmonic mean, etc. Options include analysis of variance, eta, eta squared, and tests for linearity R and R-square. 2.4: Frequency distributions The Frequencies procedure provides statistics and graphical displays that are useful for describing many types of variables. For a first look at your data, the Frequencies procedure is a good place to start. For a frequency report and bar chart, you can arrange the distinct values in ascending or descending order or order the categories by their frequencies. The frequencies report can be suppressed when a variable has many distinct values. You can label charts with frequencies (the default) or percentages. We want to see whether there is any disparity in educational level by regions in a country. We use the following command: FREQUENCIES VARIABLES=region edu /ORDER = ANALYSIS. Alternatively, choose Analyze, Descriptive Statistics, and then Frequencies from the menu bar. Statistics and plots options will provide frequency counts, percentages, cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum and maximum values, standard error of the mean, skewness and kurtosis (both with standard errors), quartiles, user-specified percentiles, bar charts, pie charts, and histograms. 2.5: Missing Values in SPSS Like STATA, a missing value is shown by a dot (. ). Any Descriptive commands do not take into account missing values. Thus, missing values are excluded from any commands. This will be displayed in output file as shown below.

Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent 24747 100.0% 0.0% 24747 100.0% 2.6: Counting observations In order to count the number of observations in the data set, select Analyze menu and then select Compare Means. Choose Number of cases option in the Compare Means menu. Or use command in the syntax: MEANS TABLES=region BY edu /CELLS COUNT. The above command will provide the number of observations in each region by educational levels. 3: Working with data files: changing dataset 3.1: Generating new variables You can create new variables or replace the values of existing variables. For new variables, you can also specify the variable type and label. You can compute values selectively for subsets of data based on logical conditions. Compute Variable computes values for a variable based on numeric transformations of other variables. You can compute values for numeric or string (alphanumeric) variables. You can use over 70 built-in functions, including arithmetic functions, statistical functions, distribution functions, and string functions. From the menus choose Transform and then Compute. Type the name of a single target variable. It can be an existing variable or a new variable to be added to the

working data file. To build an expression, either paste components into the Syntax Editor or type directly in the Syntax Editor. Paste descriptive statistics (e.g. means, sum, maximum, minimum, etc) or cumulative or inverse distribution functions from the function list and fill in the parameters indicated by question marks. String constants must be enclosed in quotation marks or apostrophes. Numeric constants must be typed in American format, with the period (.) as the decimal indicator. To create a new variable oldhead, the following commands will serve the purpose: COMPUTE oldhead = 0. EXECUTE. IF (age > 32) oldhead = 1. EXECUTE. The first command generates a variable called oldhead (i.e. the household heads who are old) and sets the value of the variable for all observations equal to zero. The second command replaces the variable equal to one if the condition (household head is older than 32 years) is satisfied. Note: (i) Compute command in SPSS is equivalent to Generate in STATA. To replace the existing variable, while STATA requires Replace command, Compute in SPSS overwrites the existing variable. (ii) Like STATA, SPSS variable names should not exceed 8 letters. (iii) In STATA, there is a command egen, which is an extension of the generate command. Its powerful feature is that it has ability to store descriptive statistics like mean, sum, maximum, etc. for all observations in the data set, which set the same value for a variable. Unfortunately, SPSS does not have such a feature. In order to get the same outcome in SPSS as the command in STATA, egen spop = sum (weight*size), you need to compute spop in another file and then merge it with the current working file.

COMPUTE one = 1. EXECUTE. AGGREGATE /OUTFILE = ' C:\AGGR.SAV ' /BREAK = one /SPOP = SUM(POP). MATCH FILES /FILE = * /FILE = ' C:\AGGR1.SAV ' /BY one. EXECUTE. Once the variable called spop is generated, you can fill the same value into the remaining observations by using Replace Missing Values from the Transform menu. RMV /spop = SMEAN(spop). 3.2: Producing Graphs SPSS can produce basic graphs. To produce a graph, select Graphs from the menu bar. Select the type of graph you want from the Graphs menu. Choose the icon for the specific type of chart you want. You also need to indicate how your data are organized. Suppose that we want to create a clustered bar chart for groups of cases. And click on Define. To create a clustered bar chart, you need to select a category variable and a cluster variable. For instance, we want to show the number of people of each gender in each education level. Select the edu variable for the category axis variable. Select gender for the cluster variable. And click on OK. This produces a clustered bar chart of Education Category by Gender. GRAPH /BAR(GROUPED)=COUNT BY edu BY gender /MISSING = REPORT /TITLE = ' Educational Levels by Gender '

The result will appear in the Output file. 12000 Educational Level by Gender 10000 8000 6000 4000 GENDER Count 2000 0 Primary Secondary Upper secondary University Male Female Educational level Once you have created a graph, there are many attributes you can edit to change its appearance. You can change the title, labeling, fonts, or colors. You can also delete categories, change the scale axis range, and swap axes. Of course, the type of graph can be changed from one to another (e.g. from bar to pie or other type). To edit a graph, double-click on the graph you want. This displays the graph in the SPSS Chart Editor window. You can edit the graph from the menus, from the toolbar, or by double-clicking on the object you want to edit. For example, if you want to change title or labels of axes, select title or axis options from Chart menu in the Chart Editor window. 3.3: Combining Data sets How to merge one data set with another data set? Add Variables in the Data menu merges the working data file with an external data file that contains the same cases but different variables. For example, you might want to merge a data file (i.e. member s file in household survey) that contains each member s

characteristics within household with one that contains information on the head of household (i.e. household file in household survey). Cases must be sorted in the same order in both data files. If one or more key variables are used to match cases, the two data files must be sorted by ascending order of the key variable(s). Variable names in the second data file that duplicate variable names in the working data file are excluded by default because Add Variables assumes that these variables contain duplicate information. By default, the list of variables excluded from the new merged data file contains any variable names from the external data file that duplicate variable names in the working data file. Variables from the working data file are identified with an asterisk (*). Variables from the external data file are identified with a plus sign (+). If you want to include an excluded variable with a duplicate name in the merged file, you can rename it and add it to the list of variables to be included. If some cases in one file do not have matching cases in the other file (that is, some cases are missing in one file), use key variables to identify and correctly match cases from the two files. You can also use key variables with table lookup files. The key variables must have the same names in both data files. Both data files must be sorted by ascending order of the key variables, and the order of variables on the Key Variables list must be the same as their sort sequence. Cases that do not match on the key variables are included in the merged file but are not merged with cases from the other file. Unmatched cases contain values for only the variables in the file from which they are taken; variables from the other file contain the system-missing value. Example: Merge file 1 with file 2. In the working file 1, sort cases in ascending order by key variables contained both files. The command in the syntax is as below given that the key variables are id1 id6. After sorting out and saving file 1, merge it with file 2 to add more variables in which you are interested.

SORT CASES BY id1 (A) id2 (A) id3 (A) id4 (A) id5 (A) id6 (A). MATCH FILES /FILE=* /FILE='C:\file2.sav' /BY id1 id2 id3 id4 id5 id6 EXECUTE. How to append data sets? If we want to merge files with the same variables but different cases, how should we solve this problem? Add Cases from Merge Files in Data menu merges the working data file with a second data file that contains the same variables but different cases. For example, you might record the same information for households in two different regions and maintain the data for each region in separate files. Steps involved in appending another file with working file are as follows: (a) Open one of the data files. The cases from this file will appear first in the new, merged data file. (b) From the menus choose: Data Merge Files Add Cases (c) Select the data file to merge with the open data file. (d) Remove any variables you don t want from the Variables in New Working Data File list. (e) Add any variable pairs from the Unpaired Variables list that represent the same information recorded under different variable names in the two files. For example, date of birth might have the variable name brthdate in one file and datebrth in the other file. ADD FILES / FILE = * /FILE = ' C:\INTROPOV\DATA\FILE1.SAV ' /RENAME (brthdate = datebrth ) EXECUTE.

3.4: Aggregating data Aggregate command from Data menu combines groups of cases into single summary cases and creates a new aggregated data file. Cases are aggregated based on the value of one or more grouping variables. The new data file contains one case for each group. Suppose that we want to compute total income of each household, we have to aggregate each source of income (e.g. wage and salary, incomes from interest and dividends etc) for all members living in the same household. In this case the key variable is household and aggregate variables are the sum of each income component earned by each member of the household, which will give the total income of the household. Cases are grouped together based on the values of the break variables. Each unique combination of break variable values defines a group and generates one case in the new aggregated file. All break variables are saved in the new file with their existing names and dictionary information. The break variable can be either numeric or string format. Variables are used with aggregate functions to create the new variables for the aggregated file. By default, Aggregate Data creates new aggregate variable names using the first several characters of the source variable name followed by an underscore and a sequential two-digit number. The aggregate variable name is followed by an optional variable label in quotes, the name of the aggregate function, and the source variable name in parentheses. Source variables for aggregate functions must be numeric. You can override the default aggregate variable names with new variable names, provide descriptive variable labels, and change the functions used to compute the aggregated data values. You can also create a variable that contains the number of cases in each break group. The following is aggregate command syntax. AGGREGATE OUTFILE=file [/MISSING=COLUMNWISE] [/DOCUMENT] [/PRESORTED] /BREAK=varlist[({A})][varlist] {D} /aggvar['label']aggvar['label'] = function(arguments)[/aggvar ]

3.4: Spliting file Split File splits the data file into separate groups for analysis based on the values of one or more grouping variables. If you select multiple grouping variables, cases are grouped by each variable within categories of the prior variable on the Groups Based On list. For example, if you select gender as the first grouping variable and occupation as the second grouping variable, cases will be grouped by occupational classification within each gender category. You can specify up to eight grouping variables. Each eight characters of a long string variable (string variables longer than eight characters) count as a variable toward the limit of eight grouping variables. Cases should be sorted by values of the grouping variables, in the same order that variables are listed in the Groups Based On list. If the data file is not already sorted, select Sort the file by grouping variables. Split-file groups are presented together for comparison purposes. For pivot tables, a single pivot table is created and each split-file variable can be moved between table dimensions. For graphs, a separate graph is created for each split-file group and the graphs are displayed together in the output file. All results from each procedure are displayed separately for each split-file group. Split file command syntax is: SPLIT FILE {OFF } {[{LAYERED}][BY varlist]} {SEPARAT}