Step-by-Step Guide to Basic Expression Analysis and Normalization

Transcription

1 Step-by-Step Guide to Basic Expression Analysis and Normalization Page 1

2 Introduction This document shows you how to perform a basic analysis and normalization of your data. A full review of this document will provide you with a basic understanding of a broad number of applications. In this example, we examine the GEO data set GSE11352 and track gene expression changes in response to estrogen treatment of MCF7 cells during a time-course experiment using Affymetrix arrays. You can download the data files we will use from the same page where you downloaded this document. The data set we will use is a highly modified subset of GSE34945, originally downloaded from the Gene Expression Omnibus (GEO) web site. Objectives Learn how to set up the Basic Expression Workflow and interpret results Learn about capabilities for downstream analysis of workflow results Learn about the different options for normalizing array data Basic Expression Workflow JMP Genomics incorporates several commonly used quality control and ANOVA modeling processes into the Basic Expression Workflow. This is a good tool for beginning users, because the dialog is simplified, with fewer options than in the underlying process dialogs. Basic Workflow results are organized by process and accessed from a single JMP Journal. This organization makes it simple to open tables and graphical results from one or more workflow processes. The Basic Expression Workflow combines the setup of three or more different processes into one dialog. Both the design file and the data file created during the import process are required to run this workflow. Optional files include an annotation data set and track files. Data Quality Assessment As an initial step, we will assess the quality of the data set. 1. Select Workflows > Basic > Basic Expression Workflow from the Genomics Starter to open the Basic Expression Workflow. 2. Examine the General tab. 3. Choose the edf_data.sas7bdat input data set. 4. Highlight Probe_Set_ID in the Available Variables window, and click the right arrows ( ) to add this variable to the By Variables, Label Variable, and Variables to Keep in Output fields. Page 2

3 5. Choose an output folder. All results (data sets, scripts, and graphics) generated by the Basic Workflow will be stored here. 6. Type MCF7 in the Workflow Output Name field. The completed General tab is shown below: 7. Select the Experimental Design tab. 8. Choose the edf.sas7bdat file from the supplied data as the Experimental Design Data Set. 9. Highlight and add the following variables in the Available Variables field to the specified variable fields. Color Variables Label Variable Variables Defining Plotting Groups 10. Type Time Characteristics (without the quotation marks) in the Variance Component Effects box. To specify the interaction between Time and Characteristics, type Time*Characteristics (without the quotation marks) in the Variance Component Effects box. This will calculate the variances due to each individual term as well as the interaction. Page 3

4 The completed Experimental Design tab is shown below: 11. Click the Open button (next to the Experimental Design Data Set field) to open the design file. 12. Select Analyze > Distribution from the JMP menu, and add the and variables to the Y, Columns field. This will allow us to explore the design of the experiment. 13. Click OK. Note that you can quickly see the numbers of samples in each group. Clicking on one histogram bar in the distribution will highlight the same samples in the other Page 4

5 distributions and in the data set. This data set has 18 samples evenly distributed across all groups; it is a balanced design. The resulting QC plots will be colored by, and the scatter plots will be grouped by. We will examine the Variance Component Effects across all samples and probes that are due to Time and Characteristics and their interaction. We do not have any Adjustment Effects, such as array lot number, to add. 14. Close the distribution and design table windows. 15. Select the QC and Normalization tab. Examine the tab. 16. Make sure the three boxes at the top (Distribution Analysis, Correlation and Grouped Scatterplots and Correlation and Principal Variance Components Analysis) are checked. 17. Expand the PVCA (Principal Variance Components Analysis). The Cumulative Proportion of Variation to Explain with Principal Components is set to 0.9 by default. This is usually more than sufficient to get a high level of understanding about the sources of variation. Selected normalization methods are optional in the workflow, but we will consider normalization approaches later in this document. Note that quality control analyses can be run before or after normalization, or both. 18. Keep all other default values the same. The completed QC and Normalization tab is shown below: 19. Click Run. Quality Control Results When all quality control processes are complete, a JMP journal is created. Page 5

6 1. Examine the MCF7 journal. The Results of each process can be launched by clicking on the corresponding link under each heading. The Reopen Dialog links will open individual process dialogs (rather than the workflow dialog), pre-loaded with the settings specified in the workflow. The settings can be changed and the process re-run if desired. The Close All Other Windows button closes everything but the journal. Clicking Open Workflow Builder Dialog will re-open the workflow that was run in the Workflow Builder interface. 2. Click Reopen Dialog under Process 2 Data Correlation (MCF7). Examine the different tabs to see how the available options compare to the streamlined options in the Basic Expression Workflow. 3. When finished, click Close All Other Windows from the journal. 4. Click Results from under the Process 1 Data Distribution heading. The first tab on the dashboard is Kernel Density Estimates. o The Tabs section in the upper left hand corner of the report displays a list of available results tabs. In some cases, only a subset of tabs will be displayed initially. Tabs that are not currently in view may be opened by clicking on the pull-down button for that tab and selecting View Tab. o The Tabs section will often be followed by other sections. For example, in this case, a Launch Follow-Up Process section contains action buttons to launch additional analysis processes. If data tables other than those used to create figures displayed in tabs are available, they will be listed next. To close a tabbed dashboard, use the Close All button. Page 6

7 The overlaid curves display the distribution profile of all the arrays in the data set, with intensity on the x-axis and the relative frequency on the y-axis. You will note that there is one array that is slightly different than the others. If you hover over the corresponding line with the mouse, you will see that it is an untreated control array from the 12 hour time point. If we wished to remove this array from further analysis, we would not have to modify the original input data table. Since the rows present in the design file dictate the columns from the input data set that will be used in the analysis, we can simply delete the corresponding row from the design file. 5. Highlight the outlier array by clicking on the red line at the top of the plot. 6. Click the Create Subset Experimental button found above the graph window (circled in the figure above). A new file with the suffix, containing all of the data, minus the outliers, is created in the output directory. 7. Click on the Box Plots tab to bring it into view. The box plot graph shows side-by-side comparisons of all arrays in the data set. 8. Click on the red triangle hotspot to the left of the graph s title to access drilldown options for viewing data set quantiles and other information. Page 7

8 Direct your attention to the Launch Follow-Up Process section of the Results window. Launch Follow-Up buttons are surfaced when there are reasonable next steps to take after a process has run. While the suggested processes can also be accessed from the JMP Genomics menu, clicking the button to launch the process will automatically load known information, such as input data sets picked in a prior step and output folders, into the process. 9. Click the Filter Intensities button. Note that the General tab of the Filter Intensities dialog (shown below) is automatically loaded with the original input files we specified in the workflow. Page 8

9 Let s work through an example of filtering data. We will change expression values less than -1.6 to missing, and then we will delete probes where a majority of samples within either the estrogen or control group have such values. 10. Select the Replace Low/High tab. Check the box to Replace Low Values to activate the additional filtering options below. Type in -1.6 into the Replace Intensities Falling Below this Value text box. Note: There are a number of other options (e.g., Standard Deviations) to use as criteria for filtering rows from a data set. An identical set of options can be used for filtering high values. The complete Replace Low/High tab is shown below: Page 9

10 11. Select the Delete Rows tab. Under the Delete Rows option, click the radio button if any or the expressions are satisfied. Type NMISS>=5 in the Delete Rows with Number of Missing Values Satisfying This Expression text box. o When used alone, this filtering expression will remove rows where there are 5 or more missing values. o However, we will be performing the filtering step in a group-wise manner. Rather than deleting a row if 5 or more of any values in the row are missing, we will be specifying a grouping variable on the next tab so that if more than 5 of the 9 values are missing in either the estrogen or control group (or both), we will delete that row. o Many other criteria besides number of missing values can be used, and these can be combined with either OR- or AND-type Boolean statements. The completed the Delete Rows tab is shown below: Page 10

11 12. Select the Groups tab. Highlight and click to add it to the Variables Defining Groups field. o Recall that the variable classifies samples into one of two groups, E2 (estrogen-treated) and Cont (control). Type 49 in the Group Percentage for Deletion text field. o Specifying a percentage less than 50% here will cause a row to be filtered from the data set if it contains 5 or more missing values (NMISS >= 5) in one or both treatment groups. The completed Groups tab is shown below: 13. Select the Options tab. Here, you may specify a custom output file name. If you do not, a default name will be created for you 14. Click Run. Page 11

12 Note: The Results window that appears shows the path of the data set and reveals that the number of rows in the filtered data set is less than the number of rows in the original data set. Also, a new Launch Follow-Up Process window has been created, in which we could click the Basic Expression Workflow button to run a new workflow with the filtered data set. 15. Go back to the workflow results journal and click the Close All Other Windows button. 16. Click the Results button under Process 2-Data Correlation (MCF7). 17. Click the Correlation Heat Map tab. The heat map and associated dendrogram are used to visualize the magnitude of the correlations between data from all possible pairs of samples. o The clustering algorithm applied here is unsupervised. Groups of samples whose data are positively correlated cluster together in regions of light to dark red colors. Groups of samples that display negative correlations are displayed in light to dark blue areas of the heat map. Pairs of samples that display low to no correlation will cluster in gray areas of the graph. o Examine the dendrogram found to the right of the heat map. It is immediately apparent that the estrogen vs. control treatment effect explains the split between the two main clusters. Replicate samples from the same time points within each treatment cluster together. This indicates Page 12

13 that the estrogen treatment is likely a primary effect and time a secondary effect. 18. Click on one of the main branches of the dendrogram to the right of the heat map. 19. Select Analyze > Distribution from the JMP menu and add Characteristics and Time as the Y, Columns. What do you see? 20. Click on the 3-D PCA Plot tab. This plot displays a three-dimensional view of a principal components analysis performed on the paired correlations between samples. 21. Click the red triangle by the plot title and select Normal Contour Ellipsoids > Group by Column. Select Time as the grouping variable. Ellipsoids that group samples by time points are drawn. The time effect is mainly captured by the 2 nd principal component, which explains 14.1% of the variance. We could repeat the process to draw ellipsoids to group samples by treatment group. If we did, we would see that the estrogen vs. control treatment effect is captured mainly by the 1 st principal component, which captures 30.5% of the total variance. 22. Click on the 2D PCA Plots tab. These two-dimensional plots are based on the same data set as the 3D plots. Use these plots to examine the distribution of samples across each component. This can be helpful, particularly when working with large data sets. 23. Scroll to the bottom of the 2D PCA Plots window. Page 13

14 The plot of Mahalanobis distances compares the distance of each array from a centroid or center of mass of the data points, taking into account the covariance. This may be useful in identifying potential outliers among arrays in the data set. 24. Select the Variance Component Charts tab. The top chart represents the weighted average proportion of variances across all samples and probes, accounting for 90% of the total variance. (For the purposes of calculation, the estimated variances are the result of treating all experimental variables as random effects in a mixed model.) Consistent with earlier observations, we can see that the estrogen vs. control treatment is the largest effect, time is next, and the interaction of the two explains least of the overall variability. The residual or error is about 38% of the total. If more was known about the data (array lot numbers, etc.), we might be able to attribute variance to technical effects and decrease the unexplained residual variance. Page 14

15 The lower graph, Variance Proportion by Principal Component, shows the breakdown of variance explained by each principal component. For example, we see that the Characteristics variable (estrogen or control) explains nearly 100% of the variance captured by the first principal component, Time dominates the 2 nd and 4 th components, and the interaction explains most of the 3 rd, 5 th and 7 th principal component. o Note that you may click the red triangle to the left of the Variance Proportion by Principal Component title to view more options. To view each component in a separate graph, uncheck the overlay option by clicking on Overlay Plots > Overlay Y s. The underlying tables for this and all other tabs are hidden by default, but may be opened either from the Window List or the tabbed report. Under the Tabs section on the upper left of the dashboard, click on the pull-down menu for the tab of interest and click View Data. We now have a very good estimation of the relative amounts of variance explained by the experimental variables in our data set. 25. Go back to the Workflow Journal and click Close All Other Windows. 26. Click on Results under Process 3 ArrayGroupCorrelation (MCF7). A window opens that shows two different scatter plot grids. We see two grids because we elected to analyze samples within control and estrogen groups separately in the Basic Expression Workflow dialog. Each grid displays a set of scatter plots in which data from pairs of samples are plotted against one another. This graphic provides a summary view of the correlation between replicate arrays in each treatment group. Page 15

16 On the diagonal we see distribution histograms for each individual sample. Off-diagonal scatterplots show probe intensities for pairs of samples. 27. Select Window > Close All from the top menu bar. Analysis of Variance The next step is to run the analysis of variance (ANOVAS) in the Basic Expression Workflow. Note that the quality control and ANOVAS functions can be run together, but normally, you will want to check the data quality prior to running the ANOVAS. We will not be normalizing this data set in the workflow. We will cover normalization options in a separate section later in this document. 1. Select File > Load Life Sciences Setting and navigate to the output directory used for the quality control analysis. 2. Select the BasicExpressionWorkflow MCF7 setting. The previous settings automatically load into the window. Now we just need to change some settings to set up the ANOVAs. 3. On the General tab, change the Workflow Output Name to ANOVAs1. 4. Select the QC and Normalization tab and uncheck the three QC processes previously run. 5. Select the ANOVA tab. 6. Select and add and to the Class Variables field. Page 16

17 7. Type Time Characteristics Time*Characteristics in the Model These Fixed Effects field. In this case we have a fairly simple model, a 2x3 factorial ANOVAs. For more information about specifying complex models, click next to the Model these Fixed Effects field, or refer to the SAS documentation for the Model statement in PROC MIXED. We have no random effects in this model, but if we did, they would be added as Class Variables and entered in the Adjust Variability for these Random Effects field. 8. Select and copy the text in the Model These Fixed Effects field. 9. Select the LSMeans tab. 10. Paste the copied text into the Estimate LSMean Differences for These Fixed Effects field. The LSMeans Difference Set for Volcano Plots radio buttons are used to select standard sets of differences automatically, without the Difference Chooser. For factorial designs, the Simple Differences option shows only comparisons between groups differing at levels of a single factor. For example, if age and sex were crossed factors, this option would produce a comparison between young males and young females, but not between young males and old females. Differences with a Control compares all other groups with the control group identified in the LSMeans Control Levels box. When None is selected, multiple testing corrections are performed on the F- tests of each fixed effect, with no differences calculated. This can be Page 17

18 particularly helpful in understanding which genes are changing in response to an interaction term. 11. Make sure that Simple Differences is selected as the LSMeans Difference Set for the Volcano Plots. 12. Click the Difference Chooser button. The Difference Chooser dialog opens in a new window. This tool allows you to easily set up a subset of differences to calculate. 13. Set up the Difference Chooser dialog as shown below: 14. Click Save. The file with the selected differences will be created, and its path automatically loaded in the difference data set field. The completed LSMeans tab is shown below: Page 18

19 15. Select the Multiple Testing tab. In the pull-down menu for Multiple Testing Method, select FDR, which is the Benjamini-Hochberg correction for false discovery rate. Note all of the available methods. Keep all other defaults on the Multiple Testing tab. 16. Select the Annotation Tab. Choose the supplied annotation file hg_u133_plus_2_na28_annot.sas7bdat. Select as the Annotation Merge Variable and as the Annotation Label Variable. The completed Annotation tab is shown below: 17. Leave the Tracks tab blank. 18. Click Run. ANOVA Results 1. Click Results from the journal ANOVAs1. The dashboard shows a volcano plot for each comparison and a set of action buttons. There are many ways to drill down into the data at this point. We will cover some of the main features. 2. Examine one of the Volcano Plots. Page 19

20 A volcano plot presents a summary view of the results for all probe sets for a single comparison. Each point represents a single probe set. The x-axis value for that point is the difference between the two group means being compared. The y-axis value for the point is log 10 (p-value) associated with its difference. The red dotted line represents the adjusted is log 10 (p-value) significance cutoff for the multiple testing correction method specified earlier. Points on the right have positive differences, and those that fall above the dotted line are considered significantly increased in the 12hr E2 group relative to the 12hr Cont group in the figure above. Points on the left with negative differences that fall above the dotted line are considered significantly decreased in the 12hr E2 group relative to the 12hr Cont group in the figure above. 3. Select View Data from the Results pull-down menu under the Tabs section. This table contains the results from the ANOVAS analyses run for all probe sets. Note that annotation information has been merged with the statistical results from the ANOVAS calculations, which include LSMeans, differences, and p-value information. 4. Scroll to the right to find columns with names that start with. For every difference p-value, a corresponding column is created that contains 0-1 values that indicate whether that p-value met the significance criterion. If a p-value was significant, a value of 1 is placed in the corresponding cell in its column. If a result was not significant, then a value of 0 is recorded. We can use these variables in a number of ways. For example, they can be used to generate Venn Diagrams showing the relationship among significance results, as in the following section. 5. Click the Venn Diagram action button in the ANOVAS results dashboard. 6. Select all four columns by clicking the first and shift-clicking the last. Click OK. Page 20

21 Each section in the resulting Venn diagram captures the overlap of significant results for one or more comparisons. Venn Diagrams can be generated from any table with one or more binary variables coded as 0 and 1, using General Utilities > Venn Diagram Single Table. The middle section of the graphic contains the number 2614, which is the number of probe sets that exceeded our significance criteria for all four difference comparisons. 7. Click the section labeled 2614, view the volcano plots, and then return to the data table. What do you see? 8. Click Tables > Subset. Clicking sections of the Venn diagram highlights rows in the underlying _amr table. We can use JMP tools to create new data tables that contain interesting subsets of genes, and use these subset tables to perform additional analyses. In the Subset dialog, make sure the Selected Rows and All Columns options are selected. Click OK. The resulting table can now be saved as a JMP table, SAS table, txt file, or file. Next we will manipulate this data set to create a wide data set that can be used to perform hierarchical clustering of these 2614 genes. Page 21

22 9. Close the subset data table. 10. From the Action Buttons section of the dashboard, click the Open Subset in Wide Format button. 11. In the Wide Subset dialog that opens, highlight Probe_Set_ID. 12. Type pr_ in the Optionally Enter a Common Prefix for Wide Column Names box. 13. Type venn_wide as the output data set name. The complete dialog is shown below: 14. Click OK. A new file is created, venn_wide. This contains the original intensity values for the 2614 probes merged with the design file information. We ll come back to this table when we perform clustering. 15. Close the venn_wide table. The SAS data set is saved in the output folder. 16. Close the Venn Diagram graphic. 17. Examine the remaining Action Buttons. Page 22

23 If we wanted to create a tall data table with rows corresponding to only those 2614 probes, we would have clicked the Open Subset in Tall Format button instead. If you have an Ingenuity Pathway Analysis license, you could also click the IPA Upload button to send the list of selected probesets directly to Ingenuity for further analysis. Next, we will drill down to examine detailed results for just a few probes. 18. Go to the 12hr E2 12hr Control volcano plot and select a few points on the far right (no more than 5) by drawing a box with your mouse. 19. In the Action Buttons menu, click Construct Oneway Plots. 20. Highlight and in the upper box, and in the lower box. Click OK. Separate graphics are created for each probe set, plotting the original intensities of that probe set for samples grouped by their values of Time and Characteristics. Page 23

24 21. Click the red triangle next to the plot title to examine additional options. 22. From the Action Buttons menu, click the Fit Model and Plot LSMeans button. In the resulting dialog, highlight and click OK. A graphical profile (shown below) is created for each probe set, showing its LSMeans values over different time points, treatment groups, and levels of the interaction term. 23. From the Action Buttons menu, click the Plot Intensities button. 24. In the Plot Intensities dialog, add the data columns under Intensity Columns to Plot. 25. Change the Label Variable to. Page 24

25 26. Click Run. A parallel plot is now shown in which each line represents the intensity of an individual probe set. Observe how that probe set s value changes across different samples. Using the Data Filter to Select Genes Genes are often selected according to certain criteria in the results data set. In this section, we will review using the Data Filter to select significant genes with 2-fold or greater change (i.e., log2 mean differences greater than 1 or less than -1). 1. In the ANOVAS results, select Rows > Data Filter. 2. Highlight the 12hr E2 12hr Control Significance Index and Diff columns in the Data Filter variable list. Page 25

26 3. Click Add. A new dialog opens 4. Click the OR button (circled above), then click Add without changing the selections. 5. Fill out the filter as shown below: Note: It is simpler to click on the numbers at either end of the range for the variable and type the 1 and -1 instead of using the slider. Page 26

27 Check the Show box in the data filter and examine the volcano plots. What happened to the volcano plots? 6. From the Results journal, click on Reopen Dialog to open the ANOVAs application process. 7. Select the Test tab. Expand the Mean Difference Filter outline. On the ANOVAs application process interface, there are options in addition to those offered in the workflow. You can set value cutoff filters for differences, as in this example where probe sets will be given a significance index value of 1 only if they pass the p-value cutoff and have a mean difference of greater than 1 or less than -1 (equivalent to a 2-fold change in this example, due to log 2 transformation of the input data). The Compute Multiple Testing option (indicated with an arrow above) will result in an individual multiple testing adjustment for each comparison, as opposed to the default global adjustment. This may be useful in instances where the p-value distributions are very different between comparisons in the data. 8. Return to the ANOVAs results dashboard and examine the action buttons. Page 27

28 9. Click the button labeled Create Subset with Mean Difference and P-value Criteria. This dialog functions similarly to the one in the Test tab of the ANOVAs dialog, except that instead of changing the way significance index variables are constructed, this dialog creates a subset data table from the ANOVAs results. Select the first two Difference columns, and type MySubset in the optional name field. 10. Click OK. A new table called is created, containing the 163 probe sets that satisfied the specified criteria for at least one of the two chosen comparisons. Pattern Analysis We will run three different pattern discovery analyses using the venn_wide data table we created earlier: hierarchical clustering, principal components, and cross-correlation. Hierarchical Clustering Analysis 1. Select Pattern Discovery > Hierarchical Clustering from the Genomics Starter window. 2. Examine the General tab. 3. Choose the venn_wide data table from the output directory as the input data set. 4. Add to the Label Variable field. 5. Add to the Compare Variables field. Page 28

29 6. Type pr_: in the List-Style Specification box. This specifies that any variable that starts with the prefix in the data table is to be used for clustering.. 7. Specify an Output Folder 8. Select the Options tab. 9. Select Fast Ward as the Hierarchical Clustering Method. The Two-Way Clustering and Center Rows boxes should be checked by default. 10. Check the Standardize Variables (Columns) before Clustering check box. The Color Theme for Heat Map can be selected prior to clustering. Leave the default for now. If you wish to change the color after clustering, that can be done with JMP tools. Page 29

30 11. Click Run. What do you see? Does it make sense? 12. Click Apply at the bottom of the Hierarchical Clustering dialog. This will keep these settings as a default as we move to the next process. 13. Click Close All from the dashboard. Principal Component Analysis 1. Select Pattern Discovery > Principal Components Analysis from the Genomics starter menu. Note that because we clicked Apply in the previous process, the General tab is filled in automatically with the input table and output path. 2. Specify in the Color Variables field. 3. Type pr_: in the List Style Specification of Continuous Variables field. 4. Select the Options tab. 5. Complete the options as shown below: 6. Click Run. Page 30

31 What do you see? 7. Click Genomics > General Utilities > Clear Parameter Defaults. 8. Click Close All from the dashboard. Cross Correlation Analysis We are going to perform a cross-correlation analysis on all possible pairs of genes in this data set to see which genes are negatively correlated over time. 1. Select Pattern Discovery > Cross Correlation from the Genomics Starter. 2. Examine the General tab. 3. Choose the venn_wide data table from the output directory as the input data set. 4. Add to the Label Variable field. 5. Add to the Compare Variables field. 6. Type pr_: in the List-Style Specification box. This specifies that any variable that starts with the prefix in the data table is to be used for clustering. 7. Specify an Output Folder 8. Select the Anno1 tab. Page 31

32 9. Choose the cross_corr_anno data set provided. Note: You can improve processing time by not specifying an annotation data set. 10. Complete the Anno1 tab as shown below: 11. Select the Secondary Data tab. 12. Choose the venn_wide data set previously created. 13. Type pr_: in the List-Style Specification box. 14. Select the Anno2 tab. 15. Choose the cross_corr_anno data set provided. 16. Complete the Anno2 tab identically to the Anno1 tab. 17. Select the Analysis tab. 18. Complete the Analysis tab as shown below: Note: The Cluster Heat Map checkbox should be unchecked when the data table is wide with many rows, as clustering can be memory intensive. 19. Select the Options tab. Complete the options as shown below: Page 32

33 The Process Group Size fields can help in managing memory usage. We are setting the minimum log 10 (p-value) for correlations to appear in the output data set to 5. Please note that if you do not use a cutoff here, your output data set will include all possible correlations, many of which may be low and therefore not of great interest, and the output data set may be extremely large. Alternatively, Multiple Testing Methods can be selected to filter the output data set. For large data sets, the p-value should be stringent. It is important to keep in mind that when this option is chosen, the full data set with all pairwise correlations is still generated as a temporary file prior to being filtered using the specified multiple testing adjustment criteria. It is important that the hard drive on which your temporary files are located has enough free space to accommodate this large intermediate data set. 20. Click Run. The sorted output file has the suffix and contains all correlations which exceeded the log 10 (p-value) of 5. Large negative correlations are found at the bottom of the table. 21. Select Rows > Data Filter. 22. In the Data Filter dialog, select Characteristics and Pearson_Correlation, click Add, then fill out the options as shown below: Page 33

34 23. Open the output graphics and examine the distribution plot. 24. Select the Plot Input Data for Selected Correlations drill down button. Click Plot Input Data. The resulting one-way plots are separated between the control and estrogen treated samples. This concludes the section on expression analysis. There are many more tools available in addition to those covered here. The dialogs for other processes are all very similar to the ones we reviewed earlier, so it should be straightforward to review other processes. Please note that you may always click the Load button at the bottom of each dialog to load sample settings for each process, and click Run to view sample output. Also, you may click the Process Description button at the top of the dialog to launch the section of the JMP Genomics User Guide documentation that contains detailed information about each process. Normalization JMP Genomics provides a number of different normalization procedures for expression and other array data. The commonly used normalization methods that will be reviewed in this section include mean, median, percentile, Loess, and quantile. Additionally, we will cover ratio analysis for 2-color arrays. JMP Genomics also has as a set of normalization options intended for count data from RNA-seq studies, which are reviewed in the step by step guide Import and Analysis of Next-Generation Data. Page 34

35 Data Standardization 1. Select Expression > Normalization > Data Standardize from the Genomics Starter window. Mean, median, and percentile normalization are all options available within this process. 2. Load the AffymetrixLatinSquare setting. The Standardization Method pull-down menu lists numerous choices. When a method such as Percentile is used, the value of the percentile (e.g., 75) should be typed into the Numerical Parameter for Advanced Standardization Methods field. 3. Select the Options tab. Page 35

36 Rows (probes or probe sets) can be standardized, but in general, standardization is performed on the columns (arrays). When using mean or median normalization, you can also standardize to a subset of genes, specified by the Subset Data Set to Use for Normalization. Note that mean and median center the values for each array at 0. Loess Normalization 1. Select Expression > Normalization > Loess Normalization from the Genomics Starter. 2. Load the DrosophilaAgingExample sample setting. Page 36

37 The smoothing parameter sets the percentage of data to use in each segment. The lower the value, the stronger the normalization. Repeated iterations of fitting steps can be performed, if desired. 3. Select the Analysis tab. When no baseline reference array is specified, the average of all arrays is used for the baseline. Alternatively, you can select a single array to use as a baseline, but this is not a commonly used option. Similar to mean and median normalization, data can be Loess normalized to a subset of the data. 4. Select the Kernel tab. For data sets with a small number of values to be standardized (e.g., mirna data), Kernel density may be used as a weight for Loess modeling. The number of points in the X and Y grid may be selected along with the bandwidth multiplier. Increasing the number of grid points results in a smoother curve, but extends run times. Increasing the Bandwidth Multiplier increases the smoothness of the estimate. Increasing the Exponential Multiplier will create greater curvature in the estimate. 5. Select the Reference Set tab. If you would like to use a different data set for creating the reference baseline, you may Choose that data set here. This option may be used in instances where you wish to standardize across different data sets in preparation for predictive modeling. The names of the output data sets can be specified on the Options tab. Quantile Normalization 1. Select Expression > Normalization > Quantile Normalization from the Genomics Starter. 2. Load the DrosophilaAgingExample sample setting. Page 37

38 There is a unique option in quantile normalization where the data from the autosomes can be normalized separately from the X and Y chromosome. This is useful when working with copy number data, for example. The Kernel, Reference Set, and Options tabs are similar in function to those in the Loess Normalization dialog. Two Color Ratio Analysis JMP Genomics provides different options for two-color array analysis depending on the experimental design. For loop designs, a mixed-model ANOVAS is appropriate, using the dye variable as a fixed effect and the array as a random effect, in addition to any other effects in the model. With a simple reference design, a ratio must first be calculated between the values of the experimental sample and the reference sample. 1. Select Expression > Quality Control > Ratio Analysis from the Genomics Starter. 2. Load the DrosophilaAgingExample sample setting. Page 38

39 The feature variable is the probe identifier. Note that you can also perform within-array Loess normalization using a By Variable such as print tip. The experimental design file supplies the name of the ratio variable. Note that this does not have to be the dye variable, as often such experiments alternate the cy3 and cy5 dyes between the reference and experimental sample. The denominator value should be listed in the appropriate field. Otherwise, the software will automatically use alphanumerical order to select this variable. 3. Select the Loess Normalization tab. The Perform Within Array Loess option should normally be checked so that intensity values are normalized prior to taking the ratio. The other functions are the same as in the Loess Normalization. 4. Select the Options tab. On this tab, we highly recommend that you enter new names in the Output Experimental Design Data Set and the Ratio Data Set fields. In the new design table, the number of data columns is halved as a result of the ratio analysis. This concludes our discussion of basic expression analysis and normalization. Page 39