Systat: Statistical Visualization Software Hilary R. Hafner Jennifer L. DeWinter Steven G. Brown Theresa E. O Brien Sonoma Technology, Inc. Petaluma, CA Presented in Toledo, OH October 28, 2011 STI-910019-3946
Topics to Cover Systat basics What is Systat? Why use Systat? Overview of the user interface Resources Command language vs. menus Importing data Accepted file types Formatting Limitations Tips and tricks Analysis tools Graphs and analyses Statistics Data manipulation Command language vs. menus Creating variables, appends/merges, transformations, selections, and grouping Saving output Graph customizations Advanced graphs and analyses Regression, significance tests, nonparametric tests, factor analysis, cluster analysis, and analysis of variance (ANOVA) TOPICS 2
What Is Systat? Systat is statistical and graphical analysis software that allows you to explore your data using both menus and a batch command language (similar to macros) 1500 15 TNMOC 1000 500 BENZW 10 5 YEAR 0 0 8 16 24 HOUR 0 3000 2000 1000 Count 0 1000 2000 3000 Count 1994 1995 INTRODUCTION 3
Why Use Systat? In data analysis, we nearly always need to investigate central tendencies, correlations, trends, and other statistical descriptions of data Systat s graphical interface allows the analyst to immediately see the data and rapidly generate and regenerate graphs for review Systat contains statistical functions not found in Excel or Access INTRODUCTION 4
Systat Basics Graphical User Interface Viewspace Workspace Commandspace INTRODUCTION 5
Systat Basics File Types Data (filename.syd,.syz) Output (filename.syo) Command (filename.syc) INTRODUCTION 6
Systat Basics Resources Help a click away Index Search Mouse-overs, F1 key? button Command line Manuals Examples Training videos at http://www.systat.com/downloads/ Useful: Interface, data, graph, help INTRODUCTION 7
Systat Basics Resources INTRODUCTION 8
Command Language vs. Menus Systat is a Windows menu driven package, but full coverage of the menu is provided in the command language Commands are useful for repetitive analyses (and we almost never do anything just once!) Commands help the analyst document analyses that have been performed and where the output is stored Commands can be used in future analyses Log window in Systat records most actions Commands = faster! INTRODUCTION 10
Importing Data into Systat Accepted file formats Limitations Data formatting Tips and tricks IMPORTING DATA 12
$ signifies text field. signifies missing data Names > 1 word require underline char. Text field is left-justified. IMPORTING DATA 16
Tricks and Tips with Excel Data sets can be processed in Excel prior to bringing them into Systat Make date/time conversions and calculations in Excel (convert date/time into separate fields for day of week, month, day, year, etc.) Prepare sums and other calculations easily performed in Excel Copy/paste values to remove all formulae Check that records are continuous Replace missing values (e.g., -999) with. (Systat s missing value code) Save as Excel (designate by NAME_sys.xls) Note that only one page of a workbook can be selected per import Hot tip: Systat doesn t like the variable name temp IMPORTING DATA 17
Exploring Your Variables Ozone data: right click on variable statistics 19
O3 O3 Common Graphs and Analyses Systat can create numerous types of graphs and plots and perform many statistical functions 150 100 50 The analyst must determine the appropriate plot(s) to answer different types of questions 150 100 50 0 1,993 1,994 1,995 1,996 1,997 1,998 YEAR 1,999 2,000 2,001 2,002 WDWE DATA ANALYSIS 0-10 0 10 20 30 40 TEMP 1 2 22
Commonly Used Plots and Statistical Functions Summary statistics quantify data characteristics Histograms understand data distribution Bar charts compare quantities (counts, or means) Scatter plots understand relationships Box plots compare distribution and central tendencies Scatter plot matrices compare many relationships Correlation analysis quantify relationships Linear regression identify predictive variables Open (WY_Site0123_data_ct.syz) DATA ANALYSIS 23
Conc. (ppb) Summary Statistics Used for Trends Plots Average Ozone 60.00 50.00 40.00 30.00 20.00 2005 2006 2007 2008 10.00 0.00 0 5 10 15 20 25 Hour Diurnal trends in median ozone concentrations for a Wyoming site from 2005 to 2008 Overall increase in average ozone concentrations observed less titration? Plot was created in Excel from Systat summary statistics by year and hour DATA ANALYSIS 26
NO2 Scatter Plots Scatter plots are useful for determining relationships between variables 0.9 0.8 S25LC_7 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 SO425LC_7 Sulfur vs. Sulfate 40 30 These plots are useful for both data validation and analysis Are there outliers, and if so, how do they affect comparisons? What are the similarities/differences between parameters? 20 10 0 0 10 20 30 40 50 60 70 80 90 O3 NO 2 vs. Ozone DATA ANALYSIS 27
NO2 Example of Scatter Plot Do we see the expected relationships? REM The following command (PLOT) REM creates a scatter plot of NO2 REM concentrations by wind direction REM and year. PLOT NO2*RD / OVERLAY GROUP = {YEAR} 40 30 20 10 0 0 90 180 270 360 RD YEAR 2,005 2,006 2,007 2,008 This graphic explores NO 2 concentrations and resultant wind direction as a function of year. Is there a change in the direction of high concentrations in this time period? DATA ANALYSIS 28
Box-Whisker Plots Sample box-whisker plot and a notched box whisker plot as defined by Systat Always define this plot because different packages have different definitions DATA ANALYSIS Confidence Interval (CI) for a population parameter is an interval with an associated probability p that is generated from a random sample of an underlying population such that if the sampling was repeated numerous times and the confidence interval recalculated from each sample according to the same method, a proportion p of the confidence intervals would contain the population parameter in question. 29
Example of a Notched Box-Whisker Plot Notched box-whisker plots are useful for showing the central trends of the data (i.e., the median) while also showing variability (i.e., the box and whiskers) REM The following command (DENSITY) REM creates a notched box plot of ozone REM concentrations by year. DENSITY O3 * YEAR / BOX NOTCH COLOR=BLACK O3 = ozone (ppb) DATA ANALYSIS 30
Linear and Nonlinear Regression Regression analyses identify and quantify predictive relationships between variables Options Multiple linear regression Stepwise regression Automatic outlier and influential point detection Plots of residuals vs. predicted values Many nonlinear regression forms DATA ANALYSIS 36
Example Linear Regression Analysis Before performing linear regression, it is vital to examine a scatter plot of the data! Outliers at the ends of data set highly influence linear regression Total nonmethane organic compounds (TNMOC) and NO x at 7 a.m. in an urban setting should have relatively good correlation DATA ANALYSIS 38
Example Results Effect Coefficient Standard Error Std. Coefficient Tolerance t p-value CONSTANT 28.134 2.953 0.000. 9.527 0.000 NOX 2.485 0.112 0.706 1.000 22.283 0.000 Final equation: TNMOC =2.5(NOx)+28.1 Dependent Variable TNMOC N 502 Multiple R 0.706 Squared Multiple R 0.498 Adjusted Squared Multiple R 0.497 Standard Error of Estimate 41.259 Case 344 is an Outlier (Studentized Residual : 11.168) Case 2,360 has large Leverage (Leverage : 0.053) Case 2,576 has large Leverage (Leverage : 0.047) Case 2,648 has large Leverage (Leverage : 0.038) Case 2,936 has large Leverage (Leverage : 0.036) Case 5,408 has large Leverage (Leverage : 0.036) Random Scatter Desired Case 8,028 is an Outlier (Studentized Residual : 5.155) Case 11,490 has large Leverage (Leverage : 0.047) Case 14,536 has large Leverage (Leverage : 0.060) Case 16,240 has large Leverage (Leverage : 0.040) Case 17,488 has large Leverage (Leverage : 0.045) Case 18,256 has large Leverage (Leverage : 0.047) Case 19,432 is an Outlier (Studentized Residual : -4.275) 39
Summary Systat is a powerful graphical statistical tool Explore options and learn statistics through use of the Help facility and examples Share your command files, tips, and tricks with other users SUMMARY 71
Appendix Key Systat Commands Box plot in black and white DENSITY benz*year / BOX NOTCH COLOR=BLACK Save output, graphs OSAVE file path and name /rtf (best for multiple graphs such as with by command) GSAVE file path and name /wmf (also saves.bmp,.emf,.pct,.eps,.pg, and.cgm formats) Save file path and name Export to Excel EXPORT file path and name.xls /type=excel Note that this saves only 16,000 lines!!! (Excel 3.0) APPENDIX 72
Appendix Key Systat Commands Select range of data SELECT QCS=0 AND month>5 AND Month<10 Scatter plot matrix SPLOM var1 var2 etc. / half color=black Setting coordinates DENSITY benz*hour / BOX NOTCH COLOR=BLACK xmin=0 xmax=24 xtick =6 APPENDIX 73
Appendix Troubleshooting Ideas Importing Remove any formulas or formatting in file Make sure there are no gaps (empty lines) Make sure each column is uniquely named Save as Excel 3.0 or tab-delimited.txt Scripts Go via menu and compare log with script Move the stats line one line up or down APPENDIX 74