SAS MACROS AND SAS JMP GENOMICS FOR ANALYSIS OF MICROARRAY GENE EXPRESSION DATA J. Sreekumar Central Tuber Crops Research Institute Sreekariyam, Thiruvananthapuram - 695 017 Sreejyothi_in@yahoo.com 1. Introduction Microarray gene expression experiments allows biologist to monitor the expression levels of thousands of genes simultaneously. Applications of microarrays range from the study of gene expression in plants under different environmental stress conditions to the comparison of gene expression profiles for tumour from cancer patients. DNA microarray experiments raise numerous statistical questions in different fields as diverse as image analysis, experimental design, hypothesis testing, cluster analysis and distribution theory etc. Noise creeps into microarray experiments at each stage from the preparation of tissue samples to the extraction of data. The greatest challenge to array technology lies in the analysis of gene expression data to identify which genes are differentially expressed across tissue samples or experimental conditions. This article summarizes some of the issues involved and provides a brief review of the analysis tools (macros) available in SAS and how SAS JMP genomics helps to achieve different goals in designing of microarray experiment and analysis of the gene expression data from the experiment. Any microarray experiment involves a number of distinct stages. Firstly there is the design of the experiment. The researchers must decide which genes are to be printed on the arrays, which sources of RNA are to be hybridized to the arrays and on how many arrays the hybridizations will be replicated. Secondly hybridization follows a number of data-cleaning steps or low-level analysis of the microarray data. The microarray images must be processed to acquire red and green foreground and background intensities for each spot. The acquired red/green ratios must be normalized to adjust for dye-bias and for any systematic variation other than that due to the differences between the RNA samples being studied. Thirdly, the normalized ratios are analyzed by various graphical and numerical means to select differentially expressed (DE) genes or to find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groups. 2. What is JMP Genomics? JMP Genomics is a statistical discovery software solution from the two most trusted names in analytic software: SAS and JMP. Research organizations use JMP Genomics to uncover meaningful patterns in high-throughput genetics, expression microarray and proteomics data. Dynamically interactive graphics and analysis dialog boxes make it easy to explore data relationships using a comprehensive set of traditional and advanced statistical algorithms. JMP Genomics dynamically links advanced statistics with graphics to provide a complete and comprehensive picture of your research results More than 100 procedures for genetics, microarray and proteomics analysis make JMP Genomics an all-in-one solution, whether screening of a genome for significant genetic markers, looking for meaningful patterns from expression microarrays or examining highthroughput spectral data in a proteomics lab. JMP genomics can be utilized (a) to identify key genes from large microarray data sets (b) to assess quality-control metrics to identify and remove outlier arrays (c) normalize within and across arrays to remove effects of
experimental biases (d) perform gene-by-gene modelling to discover statistically significant differences and (e) to reveal biological insight with pattern discovery and predictive modelling tools. 3. SAS Macros for Microarray data analysis There are many macros available in SAS for carrying out analysis of microarray gene expression data. Karine Piot et al., developed AnovArray package, a collection of SAS macros based on Analysis of Variance (ANOVA) models. SAS procedures handle a wide range of statistical analyses used in microarray analysis such as clustering, supervised classification, singular value decomposition, (partial least squares) regression, etc.. The AnovArray package is naturally interfaced with all this tools and benefits therefore of all SAS possibilities. The AnovArray package can be applied to analyze normalized data from macro or microarray experiments in the case of balanced factorial designs and complete model. A macro to identify differentially expressed genes between different experimental conditions under hypothesis of homogeneous variance (HOM) and heterogeneous variance (HET). The following sections of this article correspond roughly to the various analysis steps in SAS JMP genomics and a detailed view of SAS macros for analysis of gene expression data. 4. Gene expression analysis in JMP 4.1 Importing data in JMP For expression and exon analysis, JMP Genomics requires two files: a design file and a data file. The design file contains all the information regarding the sample attributes. You should include as much information as possible about your experiment, including technical variables (e.g., date or batch), as well as experimental and clinical variables. Including this type of information will make it easier to understand the sources of variance in the experiment when you run quality control processes. The design file has two required columns or variables. The Array column is numeric and has a unique number for each array. The ColumnName column contains a unique identifier for each array. JMP Genomics software contains tools to create these variables, found in the Experimental Design submenu. In preparation for importing Affymetrix expression data, the Affymetrix Experimental Design Wizard can help create a design file from Affymetrix Array Attribute (ARR) files from Affymetrix s Expression Console or from existing text or Excel file formats. Note that when importing design information from text or Excel files, the design file template must contain a column labeled File or FileName, containing the file names of all the arrays in the study, and at least one column with design information that will be used in statistical tests (e.g., Treatment). Select Genomics > Experimental Design File > Affymetrix Experimental Design File Wizard. Click Next and in the following window, name of the study and choose either Extract information from ARR Files (AGCC format) or Import design information from an existing text, CSV, or Excel file. To import Illumina expression data, go to Genomics > Import > Illumina > Expression 4.2 Experimental designs Before carrying out a microarray experiment one must decide how many microarray slides will be used and which mrna samples will be hybridized to each slide. Certain decisions must be made in the preparation of the mrna samples, for example whether the RNA from
different animals will be pooled or kept separate and whether fluorescent labelling is to be done separately for each array or in one step for a batch of RNA. Careful attention to these issues will ensure that the best use is made of available resources, obvious biases will be avoided, and that the primary questions of interest to the experimenter will be answerable. Kerr and Churchill and Glonek and Solomon apply ideas from optimal experimental designs to suggest efficient designs for the some of the common microarray experiments. Pan, Lin and Le consider sample size and Speed and Yang examine the efficiency of using a reference sample as against direct comparison. 4.3 The basic expression work flow Sample Workflow for Analysis of Microarray Data is as follows 1) Generation of the Data Sets a. Experimental Design File Builder b. Data Set Creation 2) Evaluation of the Data Quality a. Raw Data Distribution Analysis b. Ratio Analysis (Raw Data) c. Ratio Analysis (Loess Normalization) 3) Comparison of Different Methods for Data Normalization a. Data Standardization (Median) & Standardized Distribution Analysis b. Loess Normalization Across Arrays & Distribution Analysis (Loess Normalized Data) 4) Evaluation of Normalized Data Quality a. Correlation and Principle Components b. Correlation and Grouped Scatter Plots 5) Primary Data Analysis for Determining Significant Differences in Gene Expression a. Analysis of Variance b. Mixed Model Analysis 6) Further Analysis a. Transpose Tall and Wide b. K-Means Clustering?c. Distance Matrix 7) Predictive Modeling 5. SAS AnovArray package The functions of Anov Array package has been written in SAS Macro language and so they can be just submitted to SAS software. SAS procedures handle a wide variety of statistical analysis such as clustering, supervised classification, singular value decomposition, partial least square regression etc. Anov Array is available in the site http://wwwmig.jouy.inra.fr/mig/software.html The data file must be contained in a text file (.txt extension) written in columns separated by spaces or tabulation. The input data to Anov Array package is supposed to be normalized and is from a balanced experimental design. The dataset should contain a column named GENE, which represents the gene identifier. 5.1 The contents of Anov Array package The package contains five macros called global_analysis, cleandata, adjust, differential_analysis and comparison. It is advised to use an iterative process of global_analysis, cleandata and adjust macros followed by differential_analysis and comparison.
Figure 1: The Anov Array strategy for the analysis of Anov Array package 5.2 global_analysis Macro The global_analysis macro uses functionalities of SAS ANOVA; it performs an analysis of variance on the data. The model is assumed to be complete and from a balanced experimental design. In SAS output window, the macro displays the analysis of variance table with fisher s exact test for each factor under consideration. In addition it enables to calculate fitted values, residuals, standardized residuals and it produces several graphs relative to standardized residuals. The AnovArray package has to be loaded before running any macros, hence as a first step we have to load the package using the syntax %include c:/... /AnovArray.1.0.sas Then the global_analysis macro can be executed by the command %global_analysis ( data=, stmts=, outdata=, outgraph=, procopt=, options= ) where data specifies the dataset, stms specifies the proc ANOVA statements for the analysis separated by semicolons and listed as a single argument to %str macro function. options are same as Proc Anova statements. 5.3 The cleandata Macro The cleandata macro contains dataset cleaning facilities to remove suspicious genes from dataset. These genes are sometimes explicitly known. In this case, the list of their identifier can be given by the user as a cleandata macro argument.
%cleandata ( data =, outdatakeep =, outdatadrop =, outdataoutliers =, limit =, droplist =, options = ) Where data specifies the dataset to be cleaned, outdatakeep is optional and specifies the output dataset name which is the original dataset from which the selected genes have been removed, outdatadrop specifies the dataset the dataset containing the genes which are removed, limit is a real number which specifies the genes which have standardized residual less than or more than limit is removed from the original data. 5.4 The adjust macro This is the normalization step, It is useful to adjust for systematic errors before doing differential analysis. There is no output window for adjust macro but a dataset is created which contains the adjusted signal. %adjust ( data =, outdata =, signal =, list = ) where data refers to the dataset, outdata refers to the output file after adjustments, signal refers to the name of the signal to be adjusted and the effects to be subtracted from signal. 5.5 The differential_analysis Macro The differential_analysis macro is used to identify differentially expressed genes under two or more experimental or biological conditions. The hypothesis is based on anova model comparison and in this macro user can choose between two procedures where hypothesis=hom or hypothesis=het arguments. Hypothesis=hom considers genes variances are homogeneous and otherwise hypothesis=het. The differential_analysis macro provides an output dataset with adjusted p values under both homogeneous and heterogeneous assumptions. %differential_analysis ( data=, outdata =, outgraph =, hypothesis =, signal =, treatment =, fdr = ) where data specifies the dataset we are using, outdata refers to the output dataset name which contains the output of the analysis, outgraph is optional and specifies the name of the graphical output file, hypothesis specifies homogeneous or heterogeneous variance hypothesis, signal specifies the name of the variable which contains the variable to be
analysed, treatment refers to the variable which contains treatment conditions under which the differential expression has to be analysed and fdr refers to the false discovery rate. 5.6 The comparison macro The comparison macro compares the results of differential_analysis macro under two hypothesis conditions of homogeneous and heterogeneous variance conditions. Each gene which shows different conclusions under two hypothesis has to be observed particularly for its variance before finalizing the results. Figure 2 Interpretation of the comparison graph. In summary The five macros of the package can be used either independently or in a concerted way as indicated in the strategy analysis described in figure 1. The anova model is defined in the macro global_analysis by the user. This macro computes the classical anova table which permits to identify factors which are important to explain observed differences in gene expression. As explained in the previous section, several graphs described are available to check model assumptions: variance homogeneity and gaussian distribution of residuals. These graphs can also be very useful to highlight which experimental factor affects a subpopulation of genes. Several models can be tested and the quality control facilities (statistics in the table of anova, graphs) permit to select which one is the more accurate. Depending on the results given by the macro global_analysis, it could be necessary to use macros adjust and cleandata. The macro adjust will then permit to systematically remove undesirable effects (factors) observed in graphs obtained by the macro global_analysis. In the same manner, the macro cleandata makes it possible to remove genes which do not respect the assumptions of the model. We advise to use this iterative process (global_analysis, cleandata and adjust) before using the macro differential_analysis. The aim of this process is to make sure that data are well fitted by the model and that model assumptions are satisfied. This process is very important to get reliable results on differentially expressed genes. As explained in the previous section, the package also permits the differential analysis under two hypotheses: either genes have equal variance (homogeneous model HOM) or each gene has its own variance (heterogeneous model HET). The macro differential_analysis produces the list of genes differentially expressed between several experimental conditions using p- values and adjusted p-values statistics. A p-value is defined as the probability of rejecting the
null hypothesis {The interaction gene x condition is null.}, if true. P-values are calculated for each gene under the hypothesis that all genes have the same variance and under the hypothesis that each gene has its own variance. By using the correction for multiple comparisons FDR (False Discovery Rate) [Tusher, et all (2001)], a gene is differentially expressed if its adjusted p-value is lower than a significance level given by the user. Finally, the macro comparison enables to compare graphically the results obtained by the two models of variance. In a way, the plot of adjusted p-values under hypothesis of homogeneous variance versus adjusted p-values under hypothesis of heterogeneous variance indicates the genes which probably do not satisfy the homogeneity of variance hypothesis. 6. Other macros in SAS There are some other macros available for microarray data analysis using SAS. The macros developed by Don (Dongguang) Li. NCIC-CTG at Queen s University and Lei Qin. Cancer Research Institute, Queen s University are available at http://www.lexjansen.com/pharmasug/2005/posters/po34.pdf This article introduces a SAS macro based program for microarrays data normalization. The algorithms of different methods for both within array and between arrays normalization are discussed. With the breast cancer data adopted from the Stanford Microarray Database, the author implemented the program to facilitate an automatic normalization process with optional methods selection and automatic graphs generation. The SAS macros written in the McIntyre Lab, are available at http://www.genomics.purdue.edu/reports/sas_macros.htm References Kerr, M. K., and Churchill, G. A. (2001). Experimental design for gene expression microarrays. Biostatistics 2, 183-201. Glonek, G. F. V., and Solomon, P. J. (2002). Factorial designs for microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide, Australia. Pan, W., Lin, J. and Le, C. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3(5): research0022.1-0022.10. Dudoit, S., Yang, Y. H, Speed, T. P., and Callow, M. J. (2002). Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 12, 111-140. Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001). Normalization for cdna microarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty (eds.), Microarrays: Optical Technologies and Informatics, Volume 4266 of Proceedings of SPIE. Benjamini, Y., Hochberg, Y. (1995): 'Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing', J. R. Statist. Soc. B 57, No. 1, pp. 289-300. Tusher, V.G., Tibshirani, R., Chu, G. (2001): 'Significance analysis of microarrays applied to the ionizing radiation response', PNAS 98, No. 9, pp. 5116-5121. Christelle Hennequet-Antier, Hélène Chiapello, Karine Piot, Séverine Degrelle, Isabelle Hue, Jean-Paul Renard, François Rodolphe and Stéphane Robin AnovArray: a set of SAS macros for the analysis of variance of gene expression data. BMC Bioinformatics 2005, 6:150 http://www.bioinformaticsonline.com/microarray.php