Introduction to robust calibration and variance stabilisation with VSN
|
|
|
- Reynard Watkins
- 10 years ago
- Views:
Transcription
1 Introduction to robust calibration and variance stabilisation with VSN Wolfgang Huber January 5, 2016 Contents 1 Getting started 1 2 Running VSN on data from a single two-colour array 2 3 Running VSN on data from multiple arrays ( single colour normalisation ) 3 4 Running VSN on Affymetrix genechip data 4 5 Running VSN on RGList objects Background subtraction Print-tip groups Missing values 8 7 Normalisation with spike-in probes 9 8 Normalisation against an existing reference dataset 10 9 The calibration parameters The calibration parameters and the additive-multiplicative error model More on calibration Variance stabilisation without calibration Assessing the performance of VSN VSN, shrinkage and background correction Quality assessment Stratifying factors such as print-tip, PCR plate, reporter-sequence Most genes unchanged assumption Acknowledgements 17 1 Getting started VSN is a method to preprocess microarray intensity data. This can be as simple as > require("vsn") > data("kidney") > xnorm = justvsn(kidney) where kidney is an ExpressionSet object with unnormalised data and xnorm the resulting ExpressionSet with calibrated and glog 2 -transformed data. > M = exprs(xnorm)[,1] - exprs(xnorm)[,2] produces the vector of generalised log-ratios between the data in the first and second column. VSN is a model-based method, and the more explicit way of doing the above is > fit = vsn2(kidney) > ynorm = predict(fit, kidney) where fit is an object of class vsn that contains the fitted calibration and transformation parameters, and the method predict applies the fit to the data. The two-step protocol is useful when you want to fit the parameters on a subset of the data, e. g. a set of control or spike-in features, and then apply the model to the complete set of data (see Section 7 for details). Furthermore, it allows further inspection of the fit object, e. g. for the purpose of quality assessment. Besides ExpressionSets, there are also justvsn methods for AffyBatch objects from the affy package and RGList objects from the limma package. They are described in this vignette. The so-called glog 2 (short for generalised logarithm) is a function that is like the logarithm (base 2) for large values (large compared to the amplitude of the background noise), but is less steep for smaller values. Differences between the transformed values are the generalised log-ratios. These 1
2 Figure 1: Scatterplots of the kidney example data, which were obtained from a two-colour cdna array by quantitating spots and subtracting a local background estimate. a) unnormalised and log 2 -transformed. b) normalised and transformed with VSN, Panel b shows the data from the complete set of 8704 spots on the array, panel a only the 7806 spots for which both red and green net intensities were greater than 0. Those spots which are missing in panel a are coloured in orange in panel b. are shrinkage estimators of the logarithm of the fold change. The usual log-ratio is another example for an estimator 1 of log fold change. There is also a close relationship between background correction of the intensities and the variance properties of the different estimators. Please see Section 12 for more explanation of these issues. How does VSN work? There are two components: First, an affine transformation whose aim is to calibrate systematic experimental factors such as labelling efficiency or detector sensitivity. Second, a glog 2 transformation whose aim is variance stabilisation. An affine transformation is simply a shifting and scaling of the data, i. e. a mapping of the form x (x a)/s with offset a and scaling factor s. By default, a different offset and a different scaling 1 In statistics, the term estimator is used to denote an algorithm that calculates a value from measured data. This value is intended to correspond to the true value of a parameter of the underlying process that generated the data. Depending on the amount of the available data and the quality of the estimator, the intention may be more or less satisfied. factor are used for each column, but the same for all rows within a column. There are two parameters of the function vsn2 to control this behaviour: With the parameter strata, you can ask vsn2 to choose different offset and scaling factors for different groups ( strata ) of rows. These strata could, for example, correspond to sectors on the array 2. With the parameter calib, you can ask vsn2 to choose the same offset and scaling factor throughout 3. This can be useful, for example, if the calibration has already been done by other means, e. g. quantile normalisation. Note that VSN s variance stabilisation only addresses the dependence of the variance on the mean intensity. There may be other factors influencing the variance, such as gene-inherent properties or changes of the tightness of transcriptional control in different conditions. These need to be addressed by other methods. 2 See Section See Section 10. 2
3 count count sd 7 sd rank(mean) mean Figure 2: Standard deviation versus rank of the mean, and the mean, respectively. 2 Running VSN on data from a single two-colour array The dataset kidney contains example data from a spotted cdna two-colour microarray on which cdna from two adjacent tissue samples of the same kidney were hybridised, one labeled in green (Cy3), one in red (Cy5). The two columns of the matrix exprs(kidney) contain the green and red intensities, respectively. A local background estimate 4 was calculated by the image analysis software and subtracted, hence some of the intensities in kidney are close to zero or negative. In Figure 1 on page 2 you can see the scatterplot of the calibrated and transformed data. For comparison, the scatterplot of the log-transformed raw intensities is also shown. > select = (0==rowSums(exprs(kidney)<=0)) > plot(log2(exprs(kidney)[select, ]), + main = "a) raw", pch = ".", asp=1) > plot(exprs(xnorm), main = "b) vsn", + pch = ".", asp=1, + col=ifelse(select, "black", "orange")) To verify the variance stabilisation, there is the function meansdplot. For each feature k = 1,..., n it shows the empirical standard deviation ˆσ k on the y-axis versus the rank of the average ˆµ k 4 See Section 12 for more on the relationship between background correction and variance stabilising transformations. on the x-axis. ˆµ k = 1 d d i=1 h ki ˆσ 2 k = 1 d 1 > meansdplot(xnorm, ranks=true) > meansdplot(xnorm, ranks=false) d (h ki ˆµ k ) 2 (1) i=1 The two plots are shown in Figure 2 on page 3. The red dots, connected by lines, show the running median of the standard deviation 5. The aim of these plots is to see whether there is a systematic trend in the standard deviation of the data as a function of overall expression. The assumption that underlies the usefulness of these plots is that most genes are not differentially expressed, so that the running median is a reasonable estimator of the standard deviation of feature level data conditional on the mean. After variance stabilisation, this should be approximately a horizontal line. It may have some random fluctuations, but should not show an overall trend. If this is not the case, that usually indicates a data quality problem, or is a consequence of inadequate prior data preprocessing. The rank ordering distributes the data evenly along the x-axis. A plot in which the x-axis shows the average intensities themselves is 5 The parameters used were: window width 10%, window midpoints 5%, 10%, 15%,.... It should be said that the proper way to do is with quantile regression such as provided by the quantreg package - what is done here for these plots is simple, cheap and should usually be good enough due to the abundance of data. 3
4 obtained by calling the plot command with the argument ranks=false; but this is less effective in assessing variance and hence is not the default. The histogram of the generalized log-ratios is shown in Figure 3 on page 4. 3 Running VSN on data from multiple arrays ( single colour normalisation ) The package includes example data from a series of 8 spotted cdna arrays on which cdna samples from different lymphoma were hybridised together with a reference cdna [7]. > data("lymphoma") > dim(lymphoma) Frequency Histogram of M M Features Samples The 16 columns of the lymphoma object contain the red and green intensities, respectively, from the 8 slides, as shown in Table 1. Thus, the CH1 intensities are in columns 1, 3,..., 15, the CH2 intensities in columns 2, 4,..., 16. We can call justvsn on all of them at once: > lym = justvsn(lymphoma) > meansdplot(lym) Again, Figure 4 on page 4 helps to visually verify that the variance stabilisation worked. As above, we can obtain the generalised log-ratios for each slide by subtracting the common reference intensities from those for the 8 samples: > iref = seq(1, 15, by=2) > ismp = seq(2, 16, by=2) > M= exprs(lym)[,ismp]-exprs(lym)[,iref] > A=(exprs(lym)[,ismp]+exprs(lym)[,iref])/2 > colnames(m) = lymphoma$sample[ismp] > colnames(a) = colnames(m) > j = "DLCL-0032" > smoothscatter(a[,j], M[,j], main=j, + xlab="a", ylab="m", pch=".") > abline(h=0, col="red") Figure 5 on page 8 shows the resulting M-A-plot [6] for one of the arrays. Figure 3: Histogram of generalised log-ratios M for the kidney example data. 4 Running VSN on Affymetrix genechip data The package affy provides excellent functionality for reading and processing Affymetrix genechip data, and you are encouraged to refer to the documentation of the package affy for more information about data structures and methodology. The preprocessing of Affymetrix genechip data involves the following steps: (i) background correction, (ii) between-array normalization, (iii) transformation and (iv) summarisation. The VSN method addresses steps (i) (iii). For the summarisation, I recommend to use the RMA method [10], and a simple wrapper that provides all of these is provided through the method vsnrma. > require("affydata") > data("dilution") > d_vsn = vsnrma(dilution) For comparison, we also run rma. > d_rma = rma(dilution) The scatterplots produced by the code below are compared in Figure 6 on page 5. 4
5 3 count 20 DLCL 0032 sd rank(mean) Figure 4: Standard deviation versus rank of the mean for the lymphoma example data M > par(pch=".") > ax = c(2, 16) > plot(exprs(d_vsn)[,c(1,3)], + main = "vsn: array 1 vs 3", + asp=1, xlim=ax, ylim=ax) > plot(exprs(d_rma)[,c(1,3)], + main = "rma: array 1 vs 3", + asp=1, xlim=ax, ylim=ax) > plot(exprs(d_rma)[,1], + exprs(d_vsn)[,1], + xlab="rma", ylab="vsn", + asp=1, xlim=ax, ylim=ax, + main = "array 1") > abline(a=0, b=1, col="#ff0000d0") Both methods control the variance at low intensities, but we see that VSN does so more strongly. See also Section 12 for further discussion on the VSN shrinkage. 5 Running VSN on RGList objects There is a justvsn method for RGList objects. Usually, you will produce an RGList from your own data using the read.maimages from the limma package. Here, for the sake of demonstration, we construct an RGList from lymphoma. > require("limma") > wg = which(lymphoma$dye=="cy3") > wr = which(lymphoma$dye=="cy5") > lymrg = new("rglist", list( Figure 5: M-A-plot for slide DLCL-0032 from the lymphoma example data. A false-colour representation of the data point density is used, in addition the 100 data points in the least dense regions are plotted as dots. + R=exprs(lymphoma)[, wr], + G=exprs(lymphoma)[, wg])) > lymncs = justvsn(lymrg) The justvsn method for RGList converts its argument into an NChannelSet, using a copy of the coercion method from Martin Morgan in the package convert. It then passes this on to the justvsn method for NChannelSet. The return value is an NChannelSet, shown in Table 2. Note that, due to the flexibility in the amount and quality of metadata that is in an RGList, and due to differences in the implementation of these classes, the transfer of the metadata into the NChannelSet may not always produce the expected results, and that some checking and often further dataset-specific postprocessing of the sample metadata and the array feature annotation is needed. For the current example, we construct the AnnotatedDataFrame object adf and assign it into the phenodata slot of lym- NCS. > vmd = data.frame( + labeldescription=i(c("array ID", A 5
6 Figure 6: Results of vsnrma and rma on the Dilution example data. Array 1 was hybridised with 20µg RNA from liver, array 3 with 10µg of the same RNA. + "sample in G", "sample in R")), + channel=c("_all", "G", "R"), + row.names=c("arrayid", "sampg", "sampr")) > arrayid = lymphoma$name[wr] > stopifnot(identical(arrayid, + lymphoma$name[wg])) > ## remove sample number suffix > sampletype = factor(sub("-.*", "", + lymphoma$sample)) > v = data.frame( + arrayid = arrayid, + sampg = sampletype[wg], + sampr = sampletype[wr]) > v arrayid sampg sampr 1 lc7b047 reference CLL 2 lc7b048 reference CLL 3 lc7b069 reference CLL 4 lc7b070 reference CLL 5 lc7b019 reference DLCL 6 lc7b056 reference DLCL 7 lc7b057 reference DLCL 8 lc7b058 reference DLCL > adf = new("annotateddataframe", + data=v, + varmetadata=vmd) > phenodata(lymncs) = adf Now let us combine the red and green values from each array into the glog-ratio M and use the linear 6 modeling tools from limma to find differentially expressed genes (note that it is often suboptimal to only consider M, and that taking into account absolute intensities as well can improve analyses). > lymm = (assaydata(lymncs)$r - + assaydata(lymncs)$g) > design = model.matrix( ~ lymncs$sampr) > lf = lmfit(lymm, design[, 2, drop=false]) > lf = ebayes(lf) Figure 7 on page 6 shows the resulting p-values and the expression profiles of the genes corresponding to the top 5 features. > par(mfrow=c(1,2)) > hist(lf$p.value, 100, col="orange") > pdat=t(lymm[order(lf$p.value)[1:5],]) > matplot(pdat, + lty=1, type="b", lwd=2, + col=hsv(seq(0,1,length=5), 0.7, 0.8), + ylab="m", xlab="arrays") 5.1 Background subtraction Many image analysis programmes for microarrays provide local background estimates, which are typically calculated from the fluorescence signal outside, but next to the features. These are not always useful. Just as with any measurement, these local
7 Histogram of lf$p.value Frequency M lf$p.value arrays Figure 7: Left: histogram of p-values from the moderated t-test between the and groups on the lymm values. Right: M-values for the 5 genes with the smallest p-values. background estimates are also subject to random measurement error, and subtracting them from the foreground intensities will lead to increased random noise in the signal. On the other hand side, doing so may remove systematic artifactual drifts in the data, for example, a spatial gradient. So what is the optimal analysis strategy, should you subtract local background estimates or not? The answer depends on the properties of your particular data. VSN itself estimates and subtracts an over-all background estimate (per array and colour, see Section 9), so an additional local background correction is only useful if there actually is local variability across an array, for example, a spatial gradient. Supposing that you have decided to subtract the local background estimates, how is it done? When called with the argument backgroundsubtract=true 6, the justvsn method will subtract local background estimates in the Rb and Gb slots of the incoming RGList. To demonstrate this, we construct an RGList object lymrgwbg. > rndbg=function(x, off, fac) + array(off+fac*runif(prod(dim(x))), 6 Note that the default value for this parameter is FALSE. + dim=dim(x)) > lymrgwbg = lymrg > lymrgwbg$rb = rndbg(lymrg, 100, 30) > lymrgwbg$gb = rndbg(lymrg, 50, 20) In practice, of course, these values will be read from the image quantitation file with a function such as read.maimages that produces the RGList object. We can call justvsn > lymeswbg = justvsn(lymrgwbg[, 1:3], + backgroundsubtract=true) Here we only do this for the first 3 arrays to save compute time. 5.2 Print-tip groups By default, VSN computes one normalisation transformation with a common set of parameters for all features of an array (separately for each colour if it is a multi-colour microarray), see Section 9. Sometimes, there is a need for stratification by further variables of the array manufacturing process, for example, print-tip groups (sectors) or microtitre plates. This can be done with the strata parameter of vsn2. 7
8 > lymncs NChannelSet (storagemode: lockedenvironment) assaydata: 9216 features, 8 samples element names: G, R protocoldata: none phenodata rownames: (8 total) varlabels: arrayid sampg sampr varmetadata: labeldescription channel featuredata: none experimentdata: use 'experimentdata(object)' Annotation: Table 2: The NChannelSet object lymncs. > pdata(lymphoma) name sample dye lc7b047.reference lc7b047 reference Cy3 lc7b047.cll-13 lc7b047 CLL-13 Cy5 lc7b048.reference lc7b048 reference Cy3 lc7b048.cll-13 lc7b048 CLL-13 Cy5 lc7b069.reference lc7b069 reference Cy3 lc7b069.cll-52 lc7b069 CLL-52 Cy5 lc7b070.reference lc7b070 reference Cy3 lc7b070.cll-39 lc7b070 CLL-39 Cy5 lc7b019.reference lc7b019 reference Cy3 lc7b019.dlcl-0032 lc7b019 DLCL-0032 Cy5 lc7b056.reference lc7b056 reference Cy3 lc7b056.dlcl-0024 lc7b056 DLCL-0024 Cy5 lc7b057.reference lc7b057 reference Cy3 lc7b057.dlcl-0029 lc7b057 DLCL-0029 Cy5 lc7b058.reference lc7b058 reference Cy3 lc7b058.dlcl-0023 lc7b058 DLCL-0023 Cy5 Table 1: The phenodata of the lymphoma dataset. The example data that comes with the package does not directly provide the information which print-tip each feature was spotted with, but we can easily reconstruct it: > ngr = ngc = 4L > nsr = nsc = 24L > arraygeometry = data.frame( + spotcol = rep(1:nsc, + times = nsr*ngr*ngc), + spotrow = rep(1:nsr, + each = nsc, times=ngr*ngc), + pin = rep(1:(ngr*ngc), + each = nsr*nsc)) and call > EconStr = justvsn(lymrg[,1], + strata=arraygeometry$pin) To save CPU time, we only call this on the first array. We compare the result to calling justvsn without strata, > EsenzaStr = justvsn(lymrg[,1]) A scatterplot comparing the transformed red intensities, using the two models, is shown in Figure 8 on page 8. > j = 1L > plot(assaydata(esenzastr)$r[,j], + assaydata(econstr)$r[,j], + pch = ".", asp = 1, + col = hsv(seq(0, 1, length=ngr*ngc), + 0.8, 0.6)[arrayGeometry$pin], + xlab = "without strata", + ylab = "print-tip strata", + main = samplenames(lymncs)$r[j]) 6 Missing values The parameter estimation algorithm of VSN is able to deal with missing values in the input data. To demonstrate this, we generate an ExpressionSet lym2 in which about 10% of all intensities are randomly missing, > lym2 = lymphoma > nfeat = prod(dim(lym2)) 8
9 a: max diff=0.024 b: max diff= % of data missing % of data missing no missing data no missing data Figure 9: Scatterplots of fitted parameters, values on the x-axis correspond to normalisation without missing data (fit1), values on the y-axis to normalisation with 10% missing data (fit2). > wh = sample(nfeat, nfeat/10) > exprs(lym2)[wh] = NA > table(is.na(exprs(lym2))) FALSE TRUE and call vsn2 on it. > fit1 = vsn2(lymphoma, lts.quantile=1) > fit2 = vsn2(lym2, lts.quantile=1) The resulting fitted parameters are not identical, but very similar, see Figure 9 on page 9. > par(mfrow=c(1,2)) > for(j in 1:2){ + p1 = coef(fit1)[,,j] + p2 = coef(fit2)[,,j] + d = max(abs(p1-p2)) + stopifnot(d < c(0.05, 0.03)[j]) + plot(p1, p2, pch = 16, asp = 1, + main = paste(letters[j], + ": max diff=", signif(d,2), sep = ""), + xlab = "no missing data", + ylab = "10% of data missing") + abline(a = 0, b = 1, col = "blue") + } 9 Note that p1 and p2 would differ more if we used a different value than 1 for the lts.quantile argument in the above calls of vsn2. This is because the outlier removal algorithm of vsn2 will, for this dataset, identify different sets of features as outliers for fit1 and fit2 and consequently the optimisation result will be slightly different; this difference is arguably negligible compared to the noise level in the data. 7 Normalisation with spike-in probes Normally, VSN uses all features on the array to fit the calibration and transformation parameters, and the algorithm relies, to a certain extent, on the assumption that most of the features target genes are not differentially expressed (see also Section 13.2). If certain features are known to correspond to, or not to correspond to, differentially expressed targets, then we can help the algorithm by fitting the calibration and transformation parameters only to the subset of features for which the not differentially expressed assumption is most appropriate, and then applying the calibration and trans-
10 ularly important if the size of the spikeins set is relatively small. Not that this explicit subsetting strategy is designed for features for which we have a priori knowledge that their normalised intensities should be unchanged. There is no need for you to devise data-driven rules such as using a first call to VSN to get a preliminary normalisation, identify the least changing features, and then call VSN again on that subset. This strategy is already built into the VSN algorithm and is controlled by its lts.quantile parameter. Please see Section 13.2 and reference [3] for details. 8 Normalisation against an existing reference dataset Figure 8: Scatterplot of normalised and transformed intensities for the red channel of array 1. Values on the x-axis correspond to normalisation without strata (EsenzaStr), values on the y-axis to normalisation with strata (EconStr). The different colours correspond to the 16 different strata. formation to all features. For example, some experimental designs provide spike-in control spots for which we know that their targets abundance is the same across all arrays (and/or colours). For demonstration, let us assume that in the kidney data, features 100 to 200 are spike-in controls. Then we can obtain a normalised dataset nkid as follows. > spikeins = 100:200 > spfit = vsn2(kidney[spikeins,], + lts.quantile=1) > nkid = predict(spfit, newdata=kidney) Note that if we are sufficiently confident that the spikeins subset is really not differentially expressed, and also has no outliers for other, say technical, reasons, then we can set the robustness parameter lts.quantile to 1. This corresponds no robustness (least sum of squares regression), but makes most use of the data, and the resulting estimates will be more precise, which may be partic- So far, we have considered the joint normalisation of a set of arrays to each other. What happens if, after analysing a set of arrays in this fashion, we obtain some additonal arrays? Do we re-run the whole normalisation again for the complete, new and bigger set of arrays? This may sometimes be impractical. Suppose we have used a set of training arrays for setting up a classifier that is able to discriminate different biological states of the samples based on their mrna profile. Now we get new test arrays to which we want to apply the classifier. Clearly, we do not want to re-run the normalisation for the whole, new and bigger dataset, as this would change the training data; neither can we normalise only the test arrays among themselves, without normalising them towards the reference training dataset. What we need is a normalisation procedure that normalises the new test arrays towards the existing reference dataset without changing the latter. To simulate this situation with the available example data, pretend that the Cy5 channels of the lymphoma dataset can be treated as 8 single-colour arrays, and fit a model to the first 7. > ref = vsn2(lymphoma[, ismp[1:7]]) Now we call vsn2 on the 8-th array, with the output from the previous call as the reference. > f8 = vsn2(lymphoma[, ismp[8]], + reference = ref) 10
11 9 The calibration parameters If y ki is the matrix of uncalibrated data, with k indexing the rows and i the columns, then the calibrated data y ki is obtained through scaling by λ si and shifting by α si : y ki = λ si y ki + α si (2) Figure 10: Scatterplot of normalised intensities after normalisation by reference (x-axis, f8) and joint normalisation (y-axis, fall). There is good agreement. We can compare this to what we get if we fit the model to all 8 arrays, > fall = vsn2(lymphoma[, ismp]) > coefficients(f8)[,1,] [1] > coefficients(fall)[,8,] [1] and compare the resulting values in the scatterplot shown in Figure 10 on page 10: they are very similar. > plot(exprs(f8), exprs(fall)[,8], + pch=".", asp=1) > abline(a=0, b=1, col="red") More details on this can be found in the vignettes Verifying and assessing the performance with simulated data and Likelihood Calculations for vsn that come with this package. where s s(k) is the so-called stratum for feature k. In the simplest case, there is only one stratum, i. e. the index s is always equal to 1, or may be omitted altogether. This amounts to assuming that the data of all features on an array were subject to the same systematic effects, such that an arraywide calibration is sufficient. A model with multiple strata per array may be useful for spotted arrays. For these, stratification may be according to print-tip [6] or PCRplate [2]. For oligonucleotide arrays, it may be useful to stratify the features by physico-chemical properties, e. g. to assume that features of different sequence composition attract systematically different levels of unspecific background signal. The transformation to a scale where the variance of the data is approximately independent of the mean is h ki = arsinh(λ 0 y ki + α 0 ) (3) ( ) = log λ 0 y ki (λ0 + α 0 + y ki + α 2 0) + 1, with two parameters λ 0 and α 0. Equations (2) and (3) can be combined, so that the whole transformation is given by ) h ki = arsinh (e b si y ki + a si. (4) Here, a si = α si +λ 0 α si and b si = log(λ 0 λ si ) are the combined calibation and transformation parameters for features from stratum s and sample i. Using the parameter b si as defined here rather than e b si appears to make the numerical optimisation more reliable (less ill-conditioned). We can access the calibration and transformation parameters through > coef(fit)[1,,] [,1] [,2] [1,] [2,]
12 For a dataset with d samples and s strata, coef(fit) is a numeric array with dimensions (s, d, 2). For the example data that was used in Section 1 to generate fit, d = 2 and s = 1. coef(fit)[s, i, 1], the first line in the results of the above code chunk, is what was called a si in Eqn. (4), and coef(fit)[s, i, 2], the second line, is b si. 9.1 The calibration parameters and the additive-multiplicative error model VSN is based on the additive-multiplicative error model [4, 5], which predicts a quadratic variancemean relationship of the form [1] v(u) = (c 1 u + c 2 ) 2 + c 3. (5) This is a general parameterization of a parabola with three parameters c 1, c 2 and c 3. Here, u is the expectation value (mean) of the signal, and v the variance. c 1 is also called the coefficient of variation, since for large u, v/u c 1. The minimum of v is c 3, this is the variance of the additive noise component. It is attained at u = c 2 /c 3, and this is the expectation value of the additive noise component, which ideally were zero (c 2 = 0), but in many applications is different from zero. Only the behaviour of v(u) for u c 2 /c 3 is typically relevant. The parameters a and b from Equation (4) 7 and the parameters of the additive-multiplicative error model are related by [1] a = c 2 c3 e b = c 1 c3 (6) This relationship is not 1:1, and it has a divergence at c 3 0; both of these observations have practical consequences, as explained in the following. 1. The fact that Equations (6) do not constitute a 1:1 relationship means that multiple parameter sets of the additive-multiplicative error model can lead to the same transformation. This can be resolved, for example, if the coefficient of variation c 1 is obtained by some other means than the vsn2 function. For example, it 7 I drop the indices s, k and i, since for the purpose of this section, they are passive can be estimated from the standard deviation of the VSN-transformed data, which is, in the approximation of the delta method, the same as the coefficient of variation [1, 3]. Then, c 3 = c 2 1 e 2b c 2 = c 1 ae b. (7) 2. The divergence for c 3 0 can be a more serious problem. In some datasets, c 3 is in fact very small. This is the case if the size of the additive noise is negligible compared to the multiplicative noise throughout the dynamic range of the data, even for the smallest intensities. In other words, the additivemultiplicative error model is overparameterized, and a simpler multiplicative-only model would be good enough. VSN is designed to still produce reasonable results in these cases, in the sense that the transformation stabilizes the variance (it turns essentially into the usual logarithm transformation), but the resulting fit coefficients can be unstable. The assessment of the precision of the estimated values of a and b (e. g. by resampling, or by using replicate data) is therefore usually not very relevant; what is relevant is an assessment of the precision of the estimated transformation, i. e. how much do the transformed values vary [3]. 9.2 More on calibration Now suppose the kidney example data were not that well measured, and the red channel had a baseline that was shifted by 500 and a scale that differed by a factor of 0.25: > bkid = kidney > exprs(bkid)[,1]=0.25*(500+exprs(bkid)[,1]) We can again call vsn2 on these data > bfit = vsn2(bkid) > plot(exprs(bkid), main="raw", + pch=".", log="xy") > plot(exprs(bfit), main="vsn", + pch=".") > coef(bfit)[1,,] 12
13 Figure 11: Scatterplots for badly biased data. Left hand side: raw data on log-log scale, right hand side: after calibration and transformation with vsn. Notice the change in the parameter b of the red channel: it is now larger by about log(4) 1.4, and the shift parameter a has also been adjusted. The result is shown in Figure 11 on page Variance stabilisation without calibration It is possible to force λ si = 1 and α si = 0 for all s and i in Equation (2) by setting vsn2 s parameter calib to "none". Hence, only the global variance stabilisation transformation (3) will be applied, but no column- or row-specific calibration. Here, I show an example where this feature is used in conjunction with quantile normalisation. > lym_q = normalizequantiles(exprs(lymphoma)) > lym_qvsn = vsn2(lym_q, calib="none") > plot(exprs(lym_qvsn)[, 1:2], pch=".", + main="lym_qvsn") > plot(exprs(lym)[,1], exprs(lym_qvsn)[, 1], + main="lym_qvsn vs lym", pch=".", + ylab="lym_qvsn[,1]", xlab="lym[,1]") The result is shown in Figure 12 on page Assessing the performance of VSN VSN is a parameter estimation algorithm that fits the parameters for a certain model. In order to see how good the estimator is, we can look at bias, variance, sample size dependence, robustness against model misspecificaton and outliers. This is done in the vignette Verifying and assessing the performance with simulated data that comes with this package. Practically, the more interesting question is how different microarray calibration and data transformation methods compare to each other. Two such comparisons were made in reference [1], one with a set of two-colour cdna arrays, one with an Affymetrix genechip dataset. Fold-change estimates from VSN led to higher sensitivity and specificity in identifying differentially expressed genes than a number of other methods. A much more sophisticated and widerscoped approach was taken by the Affycomp benchmark study, presented at It uses two benchmark datasets: a Spike-In dataset,
14 Figure 12: The left panel shows the scatterplot between the red and green intensities of the array of the lymphoma dataset after quantile normalisation followed by VSN variance stabilisation without calibration. The right panel compares the values from that method, for CH1 of the first array, to that of VSN variance stabilisation with affine calibration (lym was computed in Section 3). in which a small number of cdnas was spiked in at known concentrations and over a wide range of concentrations on top of a complex RNA background sample; and a Dilution dataset, in which RNA samples from heart and brain were combined in a number of dilutions and proportions. The design of the benchmark study, which has been open for anyone to submit their method, was described in reference [8]. A discussion of its results was given in reference [9]. One of the results that emerged was that VSN compares well with the background correction and quantile normalization method of RMA; both methods place a high emphasis on precision of the expression estimate, at the price of a certain bias (see also Section 12). Another result was that reporter-sequence specific effects (e. g. the effect of GC content) play a large role in these data and that substantial improvements can be achieved when they are taken into account (something which VSN does not do). Of course, the two datasets that were used in Affycomp were somewhat artificial: they had fewer differentially expressed genes and were probably of higher quality than in most real-life applications. And, naturally, in the meanwhile the existence of this benchmark has led to the development of new processing methods where a certain amount of overfitting may have occured. I would also like to note the interaction between normalization/preprocessing and data quality. For data of high quality, one can argue that any decent preprocessing method should produce more or less the same results; differences arise when the data are problematic, and when more or less successful measures may be taken by preprocessing methods to correct these problems. 12 VSN, shrinkage and background correction Generalised log-ratios can be viewed as a shrinkage estimator: for low intensities either in the numerator and denominator, they are smaller in absolute value than the standard log-ratios, whereas for large intensities, they become equal. Their advantage is that they do not suffer from the vari- 14
15 q h log 2 (x) glog 2 (x, c) log 2 (x + x off ) Figure 13: The shrinkage property of the generalised log-ratio. Blue diamonds and error bars correspond to mean and standard deviation of the generalised log-ratio h, as obtained from VSN, and black dots and error bars to the standard log-ratio q (both base 2). ance divergence of the standard log-ratios at small intensities: they remain well-defined and have limited variance when the data come close to zero or even become negative. An illustration is shown in Figure 13 on page 16. Data were generated from the additivemultiplicative error model [3 5]. The horizontal line corresponds to the true log 2 -ratio 1 (corresponding to a factor of 2). For intensities x 2 that are larger than about ten times the additive noise level σ a, generalised log-ratio h and standard logratio q coincide. For smaller intensities, we can see a variance-bias trade-off : q has almost no bias, but a huge variance, thus an estimate of the fold change based on a limited set of data can be arbitrarily off. In contrast, h keeps a constant variance at the price of systematically underestimating the true fold change. This is the main argument for using a variance stabilising transformation. Note that there is also some bias in the behaviour of q for small x 2, particularly at x 2 = 0.5. This results from the occurence of negative values in the x Figure 14: Graphs of the functions y = log 2 (x), y = glog 2 (x, c) = log 2 (x + x 2 + c 2 ) 1, and y = log 2 (x + x off ), where c = 50 and x off = 50. data, which are discarded from the sampling when the (log-)ratio is computed. Please consult the references for more on the mathematical background [1 3]. It is possible to give a Bayesian interpretation: our prior assumption is the conservative one of no differential expression. Evidence from a feature with high overall intensity is taken strongly, and the posterior results in an estimate close to the empirical intensity ratio. Evidence from features with low intensity is downweighted, and the posterior is still strongly influenced by the prior. 13 Quality assessment Quality problems can often be associated with physical parameters of the manufacturing or experimental process. Let us look a bit closer at the lymphoma data. Recall that M is the 9216 times 8 matrix of generalized log-ratios and A a matrix of the same size with the average glog 2 -transformed intensities. The dataframe arraygeometry (from Section 5.2) contains, for each array feature, the identifier of the print-tip by which it was spotted and the row and column within the print-tip sec- x 15
16 CLL 13 A spotrow Figure 15: Boxplot of A values of array CLL-13 stratified by within-sector row. The features in rows 22 and 23 are all very dim. tor. Figure 15 on page 15 shows the boxplots of A values of array CLL-13 stratified by row. > colours = hsv(seq(0,1,length=nsr),0.6,1) > j = "CLL-13" > boxplot(a[, j] ~ arraygeometry$spotrow, + col=colours, main=j, + ylab="a", xlab="spotrow") You may want to explore similar boxplots for other stratifying factors such as column within print-tip sector or print-tip sector and look at these plots for the other arrays as well. In Figure 15 on page 15, we see that the features in rows 22 and 23 are all very dim. If we now look at these data in the M-A-plot (Figure 16 on page 19), we see that these features not only have low A-values, but fall systematically away from the M = 0 line. > plot(a[,j], M[,j], pch=16, cex=0.3, An advantage of option 3 is that it works without knowing the real underlying stratifying factor. + col=ifelse(arraygeometry$spotrow%in%(22:23), + "orange", "black")) > abline(h=0, col="blue") Hence, in a naive analysis the data from these features would be interpreted as contributing evidence for differential expression, while they are 16 more likely just the result of a quality problem. So what can we do? There are some options: 1. Flag the data of the affected features as unreliable and set them aside from the subsequent analysis. 2. Use a more complex, stratified normalisation method that takes into account the different row behaviours, for example, VSN with strata (see Section 5.2). 3. It has also been proposed to address this type of problem by using a non-linear regression on the A-values, for example the loess normalization of reference [11] that simply squeezes the M-A-plot to force the centre of the distribution of M to lie at 0 along the whole A-range. However, it assumes that the stratifying factor is strongly confounded with A, and that biases that it causes can be removed through a regression on A. In the current example, if we believe that the real underlying stratifying factor is indeed row within
17 sector, this assumption means that (i) few of the data points from rows 22 and 23 have high A- values, and that (ii) almost all data points with very low A values are from these rows; while (i) appears tenable, (ii) is definitely not the case Stratifying factors such as printtip, PCR plate, reporter-sequence By default, the VSN method assumes that the measured signal y ik increases, to sufficient approximation, proportionally to the mrna abundance c ik of gene k on the i-th array, or on the i-th colour channel: y ik α i + λ i λ k c ik. (8) For a series of d single-colour arrays, i = 1,..., d, and the different factors λ i reflect the different initial amounts of sample mrna or different overall reverse transcription, hybridisation and detection efficiencies. The feature affinity λ k contains factors that affect all measurements with feature k in the same manner, such as sequence-specific labelling efficiency. The λ k are assumed to be the same across all arrays. There can be a non-zero overall offset α i. For a two-colour cdna array, i = 1, 2, and the λ i take into account the different overall efficiencies of the two dyes 8. Equation 8 can be generalised to y ik α is + λ is λ k c ik. (9) that is, the background term α is and the gain factor λ is can be different for different groups s of features on an array. The VSN methods allows for this option by using the strata argument of the function vsn2. We have seen an example above where this could be useful. For Affymetrix genechips, one can find systematic dependences of the affinities λ is and the background terms α is on the reporter sequence, however, the optimal stratification of reporters based on their sequence is an active area of research. 8 It has been reported that for some genes the dye bias is different from gene to gene, such that the proportionality factor does not simply factorise as in (8). As long as this only occurs sporadically, this will not have much effect on the estimation of the calibration and variance stabilisation parameters. Further, by using an appropriate experimental design such as colour-swap or reference design, the effects of gene-specific dye biases on subsequent analyses can be reduced. Nevertheless, there are situations in which either assumption (8) or (9) is violated, and these include: Saturation. The biochemical reactions and/or the photodetection can be run in such a manner that saturation effects occur. It may be possible to rescue such data by using nonlinear transformations. Alternatively, it is recommended that the experimental parameters are chosen to avoid saturation. Batch effects. The feature affinities λ k may differ between different manufacturing batches of arrays due, e.g., to different qualities of DNA amplification or printing. VSN cannot be used to simultaneously calibrate and transform data from different batches. How to reliably diagnose and deal with such violations is beyond the scope of this vignette; see the references for more [2, 6] Most genes unchanged assumption With respect to the VSN model fitting, data from differentially transcribed genes can act as outliers (but they do not necessarily need to do so in all cases). The maximal number of outliers that do not gravely affect the model fitting is controlled by the parameter lts.quantile. Its default value is 0.9, which allows for 10% outliers. The value of lts.quantile can be reduced down to 0.5, which allows for up to 50% outliers. The maximal value is 1, which results in a least-sum-of-squares estimation that does not allow for any outliers. So why is this parameter lts.quantile userdefinable and why don t we just always use the most robust value of 0.5? The answer is that the precision of the estimated VSN parameters is better the more data points go into the estimates, and this may especially be an issue for arrays with a small number of features 9. So if you are confident that the number of outliers is not that large, using a high value of lts.quantile can be justified. There has been confusion on the role of the most genes unchanged assumption, which presumes that only a minority of genes on the arrays is 9 more precisely, number of features per stratum 17
18 detectably differentially transcribed across the experiments. This assumption is a sufficient condition for there being only a small number of outliers, and these would not gravely affect the VSN model parameter estimation. However, it is not a necessary condition: the parameter estimates and the resulting normalised data may still be useful if the assumption does not hold, but if the effects of the data from differentially transcribed genes balance out. 14 Acknowledgements I acknowledge helpful comments and feedback from Anja von Heydebreck, Richard Bourgon, Martin Vingron, Ulrich Mansmann and Robert Gentleman. References [1] W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, and M. Vingron. Variance stablization applied to microarray data calibration and to quantification of differential expression. Bioinformatics, 18:S96 S104, , 12, 13, 14 [2] W. Huber, A. von Heydebreck, and M. Vingron. Analysis of microarray gene expression data. Handbook of Statistical Genetics, Eds.: D. J. Balding, M. Bishop, C. Cannings. John Wiley & Sons, Inc. (2003). (preprint available from author) 11, 14, 17 [3] W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, and M. Vingron. Parameter estimation for the calibration and variance stabilization of microarray data. Statistical Applications in Genetics and Molecular Biology, Vol. 2: No. 1, Article 3, , 12, 14 analysis. Journal of Computational Biology, 8: , , 14 [6] S. Dudoit, Y. H. Yang, T. P. Speed, and M. J. Callow. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica, 12: , , 11, 17 [7] A. A. Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403: , [8] L. M. Cope, R. A. Irizarry, H. A. Jaffee, Z. Wu, and T. P. Speed. A Benchmark for Affymetrix GeneChip Expression Measures. Bioinformatics, 20: , [9] R. A. Irizarry, Z. Wu, and H. A. Jaffee. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22: , [10] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4: , [11] Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed. Normalization for cdna microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 30:e15, [4] W. Huber, A. von Heydebreck, and M. Vingron. Error models for microarray intensities. Encyclopedia of Genomics, Proteomics and Bioinformatics, John Wiley & Sons, Inc. (2004). (preprint available from author) 11, 14 [5] David M. Rocke and Blythe Durbin. A model for measurement error for gene expression 18
19 > tolatex(sessioninfo()) ˆ R Under development (unstable) ( r69775), x86_64-pc-linux-gnu ˆ Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C ˆ Base packages: base, datasets, grdevices, graphics, methods, parallel, stats, utils ˆ Other packages: Biobase , BiocGenerics , affy , affydata , hgu95av2cdf , limma , vsn ˆ Loaded via a namespace (and not attached): AnnotationDbi , BiocInstaller , DBI 0.3.1, IRanges , KernSmooth , RSQLite 1.0.0, Rcpp , S4Vectors , affyio , colorspace 1.2-6, digest 0.6.8, ggplot , grid 3.3.0, gtable 0.1.2, hexbin , labeling 0.3, lattice , munsell 0.4.2, plyr 1.8.3, preprocesscore , scales 0.3.0, stats , tools 3.3.0, zlibbioc Table 3: The output of sessioninfo on the build system after running this vignette. 19
Gene Expression Analysis
Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands
Practical Differential Gene Expression. Introduction
Practical Differential Gene Expression Introduction In this tutorial you will learn how to use R packages for analysis of differential expression. The dataset we use are the gene-summarized count data
Analysis of gene expression data. Ulf Leser and Philippe Thomas
Analysis of gene expression data Ulf Leser and Philippe Thomas This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser:
Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University
Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation
Quality Assessment of Exon and Gene Arrays
Quality Assessment of Exon and Gene Arrays I. Introduction In this white paper we describe some quality assessment procedures that are computed from CEL files from Whole Transcript (WT) based arrays such
Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method
Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method Pierre R. Bushel, Jianying Li April 16, 2015 Contents 1 Introduction 1 2 Installation 2 3 An example run
Real-time PCR: Understanding C t
APPLICATION NOTE Real-Time PCR Real-time PCR: Understanding C t Real-time PCR, also called quantitative PCR or qpcr, can provide a simple and elegant method for determining the amount of a target sequence
Frozen Robust Multi-Array Analysis and the Gene Expression Barcode
Frozen Robust Multi-Array Analysis and the Gene Expression Barcode Matthew N. McCall October 13, 2015 Contents 1 Frozen Robust Multiarray Analysis (frma) 2 1.1 From CEL files to expression estimates...................
Normalization of RNA-Seq
Normalization of RNA-Seq Davide Risso Modified: April 27, 2012. Compiled: April 27, 2012 1 Retrieving the data Usually, an RNA-Seq data analysis from scratch starts with a set of FASTQ files (see e.g.
Statistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
Microarray Data Analysis. A step by step analysis using BRB-Array Tools
Microarray Data Analysis A step by step analysis using BRB-Array Tools 1 EXAMINATION OF DIFFERENTIAL GENE EXPRESSION (1) Objective: to find genes whose expression is changed before and after chemotherapy.
EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq
EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq Davide Risso Modified: May 22, 2012. Compiled: October 14, 2013 1 Introduction In this document, we show how to conduct Exploratory Data
Measuring gene expression (Microarrays) Ulf Leser
Measuring gene expression (Microarrays) Ulf Leser This Lecture Gene expression Microarrays Idea Technologies Problems Quality control Normalization Analysis next week! 2 http://learn.genetics.utah.edu/content/molecules/transcribe/
netresponse probabilistic tools for functional network analysis
netresponse probabilistic tools for functional network analysis Leo Lahti 1,2, Olli-Pekka Huovilainen 1, António Gusmão 1 and Juuso Parkkinen 1 (1) Dpt. Information and Computer Science, Aalto University,
Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey
Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Analysis of Bead-summary Data using beadarray
Analysis of Bead-summary Data using beadarray Mark Dunning October 13, 2015 Contents 1 Introduction 2 2 feature and pheno data 3 3 Subsetting the data 4 4 Exploratory analysis using boxplots 7 4.1 A note
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
HowTo: Querying online Data
HowTo: Querying online Data Jeff Gentry and Robert Gentleman May 3, 2016 1 Overview This article demonstrates how you can make use of the tools that have been provided for on-line querying of data resources.
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays
Exiqon Array Software Manual Quick guide to data extraction from mircury LNA microrna Arrays March 2010 Table of contents Introduction Overview...................................................... 3 ImaGene
2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number
application note Real-Time PCR: Understanding C T Real-Time PCR: Understanding C T 4.500 3.500 1000e + 001 4.000 3.000 1000e + 000 3.500 2.500 Threshold 3.000 2.000 1000e - 001 Rn 2500 Rn 1500 Rn 2000
Creating a New Annotation Package using SQLForge
Creating a New Annotation Package using SQLForge Marc Carlson, Herve Pages, Nianhua Li February 4, 2016 1 Introduction The AnnotationForge package provides a series of functions that can be used to build
Introduction To Real Time Quantitative PCR (qpcr)
Introduction To Real Time Quantitative PCR (qpcr) SABiosciences, A QIAGEN Company www.sabiosciences.com The Seminar Topics The advantages of qpcr versus conventional PCR Work flow & applications Factors
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial
Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds Overview In order for accuracy and precision to be optimal, the assay must be properly evaluated and a few
Basic Analysis of Microarray Data
Basic Analysis of Microarray Data A User Guide and Tutorial Scott A. Ness, Ph.D. Co-Director, Keck-UNM Genomics Resource and Dept. of Molecular Genetics and Microbiology University of New Mexico HSC Tel.
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
affyplm: Fitting Probe Level Models
affyplm: Fitting Probe Level Models Ben Bolstad [email protected] http://bmbolstad.com April 16, 2015 Contents 1 Introduction 2 2 Fitting Probe Level Models 2 2.1 What is a Probe Level Model and What is
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Optimization of sampling strata with the SamplingStrata package
Optimization of sampling strata with the SamplingStrata package Package version 1.1 Giulio Barcaroli January 12, 2016 Abstract In stratified random sampling the problem of determining the optimal size
Getting started with qplot
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable
Each function call carries out a single task associated with drawing the graph.
Chapter 3 Graphics with R 3.1 Low-Level Graphics R has extensive facilities for producing graphs. There are both low- and high-level graphics facilities. The low-level graphics facilities provide basic
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
NATIONAL GENETICS REFERENCE LABORATORY (Manchester)
NATIONAL GENETICS REFERENCE LABORATORY (Manchester) MLPA analysis spreadsheets User Guide (updated October 2006) INTRODUCTION These spreadsheets are designed to assist with MLPA analysis using the kits
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
Row Quantile Normalisation of Microarrays
Row Quantile Normalisation of Microarrays W. B. Langdon Departments of Mathematical Sciences and Biological Sciences University of Essex, CO4 3SQ Technical Report CES-484 ISSN: 1744-8050 23 June 2008 Abstract
Microarray Analysis. The Basics. Thomas Girke. December 9, 2011. Microarray Analysis Slide 1/42
Microarray Analysis The Basics Thomas Girke December 9, 2011 Microarray Analysis Slide 1/42 Technology Challenges Data Analysis Data Depositories R and BioConductor Homework Assignment Microarray Analysis
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data
From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity
Using the generecommender Package
Using the generecommender Package Greg Hather May 3, 2016 1 What is generecommender? generecommender is a package used to identify coexpressed genes in microarray data. In particular, the package ranks
T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
7 Time series analysis
7 Time series analysis In Chapters 16, 17, 33 36 in Zuur, Ieno and Smith (2007), various time series techniques are discussed. Applying these methods in Brodgar is straightforward, and most choices are
Package empiricalfdr.deseq2
Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:
Graphs Exploratory data analysis Dr. David Lucy [email protected] Lancaster University A graph is a suitable way of representing data if: A line or area can represent the quantities in the data in
R4RNA: A R package for RNA visualization and analysis
R4RNA: A R package for RNA visualization and analysis Daniel Lai May 3, 2016 Contents 1 R4RNA 1 1.1 Reading Input............................................ 1 1.2 Basic Arc Diagram..........................................
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Overview This tutorial outlines how microrna data can be analyzed within Partek Genomics Suite. Additionally,
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
REAL TIME PCR USING SYBR GREEN
REAL TIME PCR USING SYBR GREEN 1 THE PROBLEM NEED TO QUANTITATE DIFFERENCES IN mrna EXPRESSION SMALL AMOUNTS OF mrna LASER CAPTURE SMALL AMOUNTS OF TISSUE PRIMARY CELLS PRECIOUS REAGENTS 2 THE PROBLEM
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Gene Expression Assays
APPLICATION NOTE TaqMan Gene Expression Assays A mpl i fic ationef ficienc yof TaqMan Gene Expression Assays Assays tested extensively for qpcr efficiency Key factors that affect efficiency Efficiency
Exploratory Data Analysis
Exploratory Data Analysis Johannes Schauer [email protected] Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Scatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
Schools Value-added Information System Technical Manual
Schools Value-added Information System Technical Manual Quality Assurance & School-based Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control
PATHOGEN DETECTION SYSTEMS BY REAL TIME PCR. Results Interpretation Guide
PATHOGEN DETECTION SYSTEMS BY REAL TIME PCR Results Interpretation Guide Pathogen Detection Systems by Real Time PCR Microbial offers real time PCR based systems for the detection of pathogenic bacteria
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Identifying Market Price Levels using Differential Evolution
Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand [email protected] WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary
STAT355 - Probability & Statistics
STAT355 - Probability & Statistics Instructor: Kofi Placid Adragni Fall 2011 Chap 1 - Overview and Descriptive Statistics 1.1 Populations, Samples, and Processes 1.2 Pictorial and Tabular Methods in Descriptive
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Assessing Measurement System Variation
Assessing Measurement System Variation Example 1: Fuel Injector Nozzle Diameters Problem A manufacturer of fuel injector nozzles installs a new digital measuring system. Investigators want to determine
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
EXPLORING SPATIAL PATTERNS IN YOUR DATA
EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze
ALLEN Mouse Brain Atlas
TECHNICAL WHITE PAPER: QUALITY CONTROL STANDARDS FOR HIGH-THROUGHPUT RNA IN SITU HYBRIDIZATION DATA GENERATION Consistent data quality and internal reproducibility are critical concerns for high-throughput
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
Bioconductor R packages for exploratory analysis and normalization of cdna microarray data
1 This is page 1 Printer: Opaque this Bioconductor R packages for exploratory analysis and normalization of cdna microarray data Sandrine Dudoit Yee Hwa Yang Abstract This chapter describes a collection
MASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST
UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly
Viewing Ecological data using R graphics
Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
DISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert May 3, 2016 Abstract FlipFlop implements a fast method for de novo transcript
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Tutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls [email protected] MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
