Application Guidance: Single-Cell Data Analysis. Fluidigm Corporation. All rights reserved.

Application Guidance: Single-Cell Data Analysis

Limited License and Disclaimer for Fluidigm Systems with Fluidigm IFCs Except as expressly set forth herein, no right to copy, modify, distribute, make derivative works of, publicly display, make, have made, offer to sell, sell, use, or import a Fluidigm system or any other product is conveyed or implied with the purchase of a Fluidigm system (including the BioMark System, EP1 System, FC1 Cycler, or any components thereof), and Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs (integrated fluidic circuits/microfluidic chips with or without a carrier), IFC controller, software, reagents, or any other items provided hereunder. This limited license permits only the use by the buyer of the particular product(s), in accordance with the written instructions provided therewith in the User Guide, that the buyer purchases from Fluidigm or its authorized representative(s). Except to the extent expressly approved in writing by Fluidigm or its authorized representative(s), the purchase of any Fluidigm product(s) does not by itself convey or imply the right to use such product(s) in combination with any other product(s). In particular, (i) no right to make, have made, or distribute other instruments, Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs, software, or reagents is conveyed or implied by the purchase of the Fluidigm system, (ii) no right to make, have made, import, distribute, or use the Fluidigm system is conveyed or implied by the purchase of instruments, software, reagents, Digital Array IFCs from Fluidigm or otherwise, and (iii) except as expressly provided in the User Guide, the buyer may not use and no right is conveyed to use the Fluidigm system in combination with instruments, software, reagents, or Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs unless all component parts have been purchased from Fluidigm or its authorized representative(s). For example, purchase of a Fluidigm system and/or the IFC controller conveys no right or license to patents covering the Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs or their manufacture, such as 6,408,878, 6,645,432, 6,719,868, 6,793,753, 6,929,030, 7,494,555, 7,476,363, 7,601,270, 7,604,965, 7,666,361, 7,691,333, 7,749,737, 7,815,868, 7,867,454, 7,867,763; and EP Patent No. 1065378. Fluidigm IFCs may not be used with any non-fluidigm reader, and Fluidigm readers may not be used with any chip other than Fluidigm IFCs. Fluidigm IFCs are single use only and may not be reused unless otherwise specifically authorized in writing by Fluidigm. All Fluidigm products are licensed to the buyer for research use only. The products do not have FDA or other similar regulatory body approval. The buyer may not use the Fluidigm system, any component parts thereof, or any other Fluidigm products in any setting requiring FDA or similar regulatory approval or exploit the products in any manner not expressly authorized in writing by Fluidigm in advance. No other licenses are granted, expressed or implied. Please refer to the Fluidigm website at www.fluidigm.com for updated license terms. Fluidigm, the Fluidigm logo, BioMark, EP1, FC1, MSL, NanoFlex, Fluidline, Access Array, Dynamic Array, and Digital Array are trademarks or registered trademarks of Fluidigm Corporation in the U.S. and/or other countries. Fluidigm Product Patent Notice Fluidigm products including IFCs (integrated fluidic circuits/microfluidic chips with or without a carrier) such as Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs, the IFC controller, FC1 Cycler and the Fluidigm system (BioMark System, EP1 System, readers, thermal cycler, etc.) and methods for reading and controlling the Access Array IFCs, Dynamic Array IFCs and Digital Array IFCs and/or their use and manufacture may be covered by one or more of the following patents owned by Fluidigm Corporation and/or sold under license from California Institute of Technology and other entities: U.S. Patent Nos. 6,408,878, 6,645,432, 6,719,868, 6,793,753, 6,929,030, 7,195,670, 7,216,671, 7,307,802, 7,323,143, 7,476,363, 7,479,186, 7,494,555, 7,588,672, 7,601,270, 7,604,965, 7,666,361, 7,691,333, 7,704,735, 7,749,737, 7,766,055, 7,815,868, 7,837,946, 7,867,454, 7,867,763, 7,906,072, 8,048,378; EP Patent Nos. 1065378, 1194693, 1195523 and 1345551; and additional issued and pending patents in the U.S. and other countries. Some Fluidigm IFC Controllers and associated IFCs may be licensed under Caliper Life Sciences. V5.3

Every effort has been made to avoid errors in the text, diagrams, illustrations, figures, and screen captures. However, Fluidigm assumes no responsibility for any errors that may appear in this publication. It is Fluidigm s policy to improve products as new techniques and components become available. Therefore, Fluidigm reserves the right to change specifications at any time. Information in this manual is subject to change without notice. Fluidigm assumes no responsibility for any errors or omissions. In no event shall Fluidigm be liable for any damages in connection with or arising from the use of this manual. No right to modify, copy, use, or distribute BioMark software is provided except in conjunction with the instrument delivered hereunder and only by the end user receiving such instrument. This software and the associated instrument are beta test systems. NO WARRANTIES ARE PROVIDED, EXPRESSED OR IMPLIED. ALL WARRANTIES, INCLUDING THE IMPLIED WARRANTIES OF FITNESS FOR PURPOSE, MERCHANTABILITY, AND NON-INFRINGMENT ARE EXPRESSLY DISCLAIMED. By continuing the installation process, user agrees to these terms. Please refer to the full text of the software license agreement supplied with the installation media for this application. Contacting Fluidigm For Technical Support, send email to: TechSupport@fluidigm.com Phone in United States: 1.866.FLUIDLINE (1.866.358.4354) Outside the United States: 650.266.6100 On the Internet: www.fluidigm.com/support Fluidigm Corporation 7000 Shoreline Court, Suite 100 South San Francisco, CA 94080 P/N 100-5066, Rev. A1

Application Guidance: Single-Cell Data Analysis 1 1 Single-Cell Data Analysis.................................... 6 Purpose of Document..................................... 6 Section 1............................................... 7 Nature of Single-Cell Transcription............................ 7 Replicates............................................ 9 Display of Data.........................................11 Limit of Detection......................................13 Qualification of Assays Prior to Single-Cell Experiments..............15 Elimination of Cells or Genes from Subsequent Analysis..............16 Normalization.........................................17 Secondary Analysis......................................19 Section 2...............................................20 Qualification of Assays...................................20 Primary Processing of Single-cell Data.........................26 Preamplification........................................37 Prepare 1:2 Dilutions....................................37 qpcr Detection........................................38 Appendix 1: Protocol for the Qualification of Assays..................37 Application Guidance: Single-Cell Data Analysis 5

Single-Cell Data Analysis Purpose of Document Single-cell researchers are currently using the Fluidigm BioMark System to measure gene expression levels for tens to hundreds of genes in hundreds to thousands of samples. The purpose of this document is to provide a practical guide on the minimum steps involved in using the BioMark system to obtain single-cell data. The focus is on how to obtain the data, rather than why it is important to study single cells. The steps in the workflow are: collection of single cells, synthesis of cdna, preamplification, qpcr detection, primary data processing, and secondary data analysis. This guide will use one particular path through this workflow as an example. The choices available at each step in the process will be the topics of subsequent documentation. The path used here is FACS collection of single cells, cdna synthesis and preamplification, and the use of DELTAgene assays for qpcr detection. The guide is divided into two sections.the first section provides background information that is important for understanding the steps in the process. The second section is a step-by-step tutorial through the process. 6 Application Guidance: Single-Cell Data Analysis

Section 1 Section 1 Nature of Single-Cell Transcription Bengtsson et al. (2005) were among the first to use qpcr to quantify transcripts in single cells. For individual cells from the mouse pancreatic islets of Langerhans, they reported a lognormal distribution for transcripts from five genes. This lognormal distribution is illustrated in Fig. 1 of Bengtsson et al. (not replicated in this document) that shows histograms of expression levels of ActB using both log and linear scales. The shapes of the distributions in their Figure 1 are very similar to the shapes in Figure 4 below. A lognormal distribution is characterized by its geometric mean rather than the arithmetic mean, and this has profound implications for the comparison of single-cell data to population data. Because of the lognormal distribution, the average expression level (arithmetic mean) observed for a population of cells is strongly biased by a few cells with a very high number of transcripts and, thus, the average expression level does not reflect the expression level in a typical cell. Bengsston et al. conclude, Accordingly, it may not be valid to extrapolate results of gene expression measurements on cell populations to the single-cell level. The lognormal distribution means that data from single eukaryotic cells show cell-to-cell variation in mrna amounts that ranges from 10- to 1000-fold depending on the gene and type of cell. Figure 1 in Bengtsson et al. indicates that the levels of ActB transcript vary approximately 1000-fold among the single cells analyzed. The histogram with the linear scale shows that only four cells have over 1000 transcripts per cell; whereas, the bulk of the cells have a more modest number with the largest bin being from zero to 100 transcripts/cell. Application Guidance: Single-Cell Data Analysis 7

Fluidigm ran a similar single-cell experiment on a 96.96 Dynamic Array Integrated Fluidic Circuit (IFC), but analyzed a much higher number of genes. Data for 77 genes in 87 single human K562 cells showing large fold-differences between individual cells are presented in Figure 1. Figure 1 Fluidigm experiment determined the number of genes exhibiting differential expression between individual cells, shown as fold-change (top X-axis labels) and equivalent delta C q s (bottom X-axis labels). These results indicate that 10- to >500-fold variation in transcript levels should be expected when comparing individual cells. Data such as these collected by multiple researchers have led to the model that eukaryotic transcripts are produced in short but intense bursts interspersed with intervals of inactivity during which transcript levels decay. Chubb et al. (2006) observed this burst-and-decay behavior in living Dictyostelium cells for the dsca gene. For this gene, they measured a mean burst duration of 5.2 min and a mean interval of inactivity of 5.8 min, but there was a great deal of stochastic variation in each of these averages. Raj et al. (2006) also directly observed intrinsically random bursting of mrna for two genes in CHO cells. This noise inherent in single-cell gene expression challenges conventional methods for obtaining and analyzing qpcr data. Factors such as replicates, data display, limit of detection, normalization, and univariate versus multivariate analysis need to be re-evaluated (see below). Some might think that this noise precludes the ability to get useful information from single cells. In fact, by acknowledging and addressing the intrinsic noise, single-cell gene expression profiling can be used to gain biological insights that are simply not possible when measuring average expression levels from hundreds or thousands of cells. 8 Application Guidance: Single-Cell Data Analysis

Section 1 Replicates Another way to assess the variation observed in single cells is to look at the standard deviation of various transcript levels in a population of single cells. For the experiment referred to in Figure 1, here are the standard deviations observed for 77 genes in a population of 87 cells. Figure 2 Standard deviation of transcript levels across single cells Only two genes show a standard deviation less than 1 cycle between single cells. For experiments run using bulk RNA, the standard deviation observed for qpcr technical replicates is typically 0.16-0.25 cycles or less. In cases where biological noise is greater than technical noise by such a large amount, it is best to focus on biological replicates rather than technical replicates. Thus, reaction bandwidth is better utilized by running more single-cell samples and by running assays for more genes than by running technical replicates of the single-cell samples. Application Guidance: Single-Cell Data Analysis 9

One way to restate the need for biological replicates is to say that data needs to be collected from a statistically significant number of single cells in order to obtain reliable results. What is a statistically significant number of single cells? This is difficult to answer absolutely. Statistical significance depends not only on the number of cells, but also on the degree of variation within the population analyzed, the number of genes assayed, the ability of those assays to differentiate the population variation, among other factors. Basics statistics would indicate that a homogenous population can be well characterized on the basis of 30 samples. Thus, if every subpopulation within a sample of single cells were represented by at least 30 cells analyzed, one would have high confidence that the single-cell experiment would robustly identify all of the subpopulations. This would mean that if one wanted to reliably identify a subpopulation that was 10% of the total population, 300 cells should be examined. In practice, subpopulations can be identified with fewer than 30 cells depending on the cells and genes being analyzed. Here is a Principal Component Analysis (PCA) from Guo et al. (2010) where they analyzed 159 single cells from 64-cell stage mouse embryos assaying 48 genes: Figure 3 Principal Component Analysis (PCA) from Guo et al. (2010). Image reprinted with permission from Developmental Cell. They were able to clearly identify the epiblast (EPI) subpopulation with only 17 cells in that subpopulation. They were able to do this because of the type of cells being analyzed, the use of 48 genes, and the fact that these 48 genes gave very distinct signatures between epiblast, primitive endoderm (PE), and trophectoderm (TE) cells. 10 Application Guidance: Single-Cell Data Analysis

Section 1 Display of Data When running qpcr experiments on bulk RNA samples, results are typically displayed as fold-change differences between samples for each individual gene. Because of the extensive variation at the single-cell level, looking at fold changes between individual cells is not very informative. What needs to be done first is assess the population behavior for each gene. This is best done by looking at histograms that bin expression levels and display the number of cells in each bin (see Figure 4). Because of the lognormal distribution described by Bengtsson et al. (2005) and others, it is useful to view single-cell data as expression level above detection limit on a log scale. For qpcr data, it is convenient to do this in log base 2 by defining the term Log 2 Ex: Log 2 Ex = LOD (Limit of Detection) C q C q [Gene] If value is negative, Log 2 Ex = 0 Log 2 Ex represents transcript level above background expressed in log base 2. Conversion from a log scale to a linear scale can be accomplished by calculating 2^Log 2 Ex. Below is a histogram comparison of ACTB transcripts in 87 human K562 cells on both a log and linear scale: A. B. Figure 4 Comparison of ACTB transcripts in 87 human K562 cells on both a log scale (A) and a linear scale (B) Application Guidance: Single-Cell Data Analysis 11

On the linear plot, the transcript level is expressed as Idealized # of Transcripts Above Background. This is because the use of the value of 2 in 2^Log 2 Ex assumes 100% PCR efficiency in the preamplification and qpcr reactions, which is probably an overestimate. Note also that the background level on the linear plot is set at zero even though 2^Log 2 Ex = 1 when Log 2 Ex = 0. The use of Log 2 Ex also facilitates plotting the number of cells where the transcript level is at or below the detection limit. Below is a comparison of IER3 transcripts in 87 human K562 cells where the IER3 transcript was not detected in 10 cells: Figure 5 Comparison of IER3 transcripts in 87 human K562 cells where IER3 transcript was not detected in 10 cells In order to compare histograms for multiple genes, it is convenient to use violin plots. Below are violin plots (essentially histograms turned on their side) from Guo et al. (2010) comparing 10 genes in 75 single cells derived from 16-cell stage mouse embryos: Figure 6 Violin plots from Guo et al. (2010) comparing Log 2 Ex levels of 10 genes in 75 single cells. Image reprinted with permission from Developmental Cell. Here, seven of the genes have monophasic distributions and three genes (Id2, Nanog, Sox2) have biphasic distributions. The monophasic distributions indicate no detectable variation other than intrinsic noise. The biphasic distributions indicate that these three genes are differentially expressed in subpopulations of these 75 cells. The vertical position of each histogram indicates the relative expression level. For example, ActB has the highest expression level among these 10 genes. It is also possible to see that transcripts can have distributions of varying widths. For example, Pou5f1 has a much narrower distribution (less variation) than Cdx2. This is a consequence of the fact that each gene will have its own characteristic burst size, burst frequency, and decay rate. 12 Application Guidance: Single-Cell Data Analysis

Section 1 If the histogram indicates two or more subpopulations, it is now possible to get meaningful fold change values. For the Id2 gene in the violin diagram, the median Log 2 Ex value for the higher expressing subpopulation is roughly 7.5 and for the lower expressing subpopulation is roughly 1.8. Thus, the Log 2 Ex between these two subpopulations is about 5.7, which corresponds to an approximate 50-fold difference (2^5.7 ) in expression levels. Limit of Detection The Log 2 Ex calculation requires defining a LOD C q value. This begs the question of what is the detection limit of qpcr. In fact, there are two separate questions: What is the detection limit of the qpcr reaction by itself?; and, What is the detection limit of the overall process of going from single cell to RNA to cdna to preamplified cdna to qpcr reaction? First, let s consider the detection limit of just the qpcr reaction. Based on digital PCR results using well-performing assays, it is clear that a single target DNA molecule in a reaction chamber will generate a positive amplification plot. That is why it is sometimes stated that the detection limit of PCR is one molecule. A more stringent definition of detection limit, though, would incorporate some indication of the confidence of detecting a target. If a number of identical PCR reactions are performed at an average concentration of one target DNA molecule per reaction chamber, then 37% of the reactions will not show a positive amplification plot due to Poisson distribution, resulting in a 63% chance of detection. For stringent detection, at what concentration is there a 99% chance of generating a positive amplification plot? It turns out this occurs at an average concentration of 5 target molecules per reaction chamber as shown by the following Poisson probability distribution: Figure 7 Poisson distribution at average of 5 targets per chamber Application Guidance: Single-Cell Data Analysis 13

Thus, a stringent definition of LOD C q would be the C q value that corresponds to 5 targets per reaction chamber because this would correspond to a 99% chance of detection with one replicate. This is a stringent definition because it minimizes the number of false negatives, but it does this at the expense of excluding true positives. In other words, you can be very confident that a positive really is a positive, but you have somewhat impaired sensitivity. In order to explore the effects of sensitivity on results, data can be analyzed using different values for LOD C q that range from stringent to relaxed. For example, the data used in Section 2 indicates that 22 cycles is a stringent LOD C q value. Thus, Log 2 Ex values could be calculated using LOD C q = 22, 23, 24, or 25, and each of these data sets analyzed to see if differing stringency has any effect on the conclusions reached. In the single-cell gene expression workflow, the qpcr reactions are preceded by preamplification of cdna. Therefore, the next step will be to explore how preamplification affects limit of detection. For Dynamic Array IFCs, 5 target molecules per reaction chamber corresponds to 625 molecule/µl in the 48.48 IFC and 730 molecule/µl in the 96.96 IFC. How much preamplification is required to produce these concentrations of target molecules, and thus attain a 99% chance of detection with a single replicate? The table below shows the number of molecules generated from a single cdna molecule at various PCR efficiencies and cycle numbers. At the end of preamplification, the sample is typically diluted to a final volume of 100 µl. Before dispensing into the Sample Inlets of a Dynamic Array IFC, this preamplification sample is further diluted by combining with other reaction components in the ratio: 2.5 µl 2 qpcr Master Mix (MM) + 0.25 µl 20 Sample Loading Reagent + 2.25 µl preamplification sample. The last two columns show the effect of these dilutions on target concentration: Pre-amp PCR Cycle Efficiency Number of Molecules Concentration (molecule/µl) in 100 µl After dilution with master mix 19 100% 262,144 2,621 1,180 20 90% 197,842 1,978 890 21 85% 220,513 2,205 990 Table 1 Molecules generated from a single cdna molecule at various PCR efficiencies and cycle numbers 14 Application Guidance: Single-Cell Data Analysis

Section 1 In these cases, the number of cycles was picked so that the final concentration exceeds the concentration that corresponds to 5 target molecules per reaction chamber in either 48.48 or 96.96 IFCs. Thus, if preamplification efficiency is at least 90%, then 20 cycles of preamplification should ensure a greater than 99% probability of detecting one original cdna molecule with one replicate. Applied Biosystems does not provide a specification for the PCR efficiency of its TaqMan PreAmp Master Mix. In the protocol for this master mix (P/N 4384557), they do state on p.21: Typically, 90% of targets produce C t values within ±1.5. Here are the expected C q (or C t ) values after 14 cycles of preamplification (as prescribed in the manual) if the only source of variation is preamplification efficiency: Efficiency 100% 95% 90% 85% 80% C q 0.0 0.5 1.0 1.6 2.1 Table 2 Preamplification efficiency The fact that the ±1.5 value must include sources of variation other than PCR efficiency means it is likely that TaqMan PreAmp Master Mix achieves at least 90% efficiency for 90% of assays. Furthermore, the validation results using standard RNAs reported by Devonshire et al. (2011) show that preamplification can have efficiencies close to 100%. The foregoing discussion indicates that the single-cell protocol should be fairly robust even if only a single cdna molecule is generated in the reverse transcriptase reaction. Of course, the overall limit of detection is critically dependent on the efficiency of the reverse transcriptase. Furthermore, this efficiency probably varies depending on the transcript and the location of the assay amplicon within the transcript. Reverse transcriptase efficiency is a factor that deserves closer scrutiny but it will not be explored here. Also, the overall availability of RNA after cell lysis will have an effect on the limit of detection for single-cell gene expression. Qualification of Assays Prior to Single-Cell Experiments There are two reasons to test assays on cdna prepared from bulk RNA before embarking on analyzing single cells. First, when using DNA binding dye assays such as DELTAgene assays, the data are used to decide on an acceptable T m range for the amplicon generated by each assay. For this purpose, it is best to use bulk RNA that is from the same or similar cells as the single cells to be studied so that splice variants will be the same as expected in the single cells. If this is not available, then appropriate tissue-specific or universal RNA or cdna can be purchased from various vendors. Second, the data are used to estimate a LOD C q value to be used in data analysis. These two properties, T m and LOD C q are characteristics of the qpcr assay and not the reverse transcriptase step or the preamplification step. Therefore, this qualification test is performed using dilutions of preamplified cdna in order to focus just on the qpcr assays. For the purpose of estimating a LOD C q value, six replicates of each dilution sample are run. Application Guidance: Single-Cell Data Analysis 15

As discussed above, a stringent LOD C q value would be the C q corresponding to 5 target molecules per reaction chamber. At this low concentration, it can be appreciated that there is a great deal of stochastic noise due to Poisson distribution that affects detection and actual C q value. Therefore, the goal is to estimate a reasonable LOD C q value without precisely determining the C q corresponding to 5 target molecules per reaction chamber. For each assay, a preliminary LOD C q is determined by taking the average C q for the most dilute sample that has positive amplification plots for all six replicates. Here is a table showing the probability that all six replicates are positive as a function of concentration: Concentration expressed as average target molecules per reaction chamber 1 2 3 4 5 6 7 8 Probability that all six replicates have positive amplification plot 0.064 0.418 0.736 0.895 0.960 0.985 0.995 0.998 Table 3 Probability that all six replicates have positive amplification plot Thus, the preliminary LOD C q values determined for each assay probably correspond to concentrations ranging from 2 to 10 target molecules per reaction chamber. The overall LOD C q is then selected as the median of all the preliminary LOD C q values rounded up to the next highest whole cycle. Because of the approximate nature of this LOD C q value, it is probably OK to use it for any subsequent assays that are used even if they were not run in this experiment. Finally, the exact LOD C q value is somewhat arbitrary and probably will not have a drastic effect on the overall interpretation of a single-cell experiment. As discussed above, this can be tested by first using the stringent LOD C q value described here and then going back and increasing the value in one cycle increments and seeing how this affects the overall results. Elimination of Cells or Genes from Subsequent Analysis Using low expression of a single control gene is probably not a reliable method for culling cells from the dataset because the level of any single gene can exhibit a wide degree of variation. Therefore, the method described in Section 2 below uses abnormally low expression of two genes. One way to do this is to include two highly expressed control genes in the set of assays used to interrogate the cells. In addition to being highly expressed, pick two control genes that are not expected to be differentially expressed in the cells being studied. After calculating Log 2 Ex values, look at expression histograms of the two control genes to confirm that their transcript distribution is monophasic. Then, calculate MEDIAN and STDEV for the two control genes across all the single cells. For each control gene, determine a Cutoff C q by calculating MEDIAN 3*STDEV. If the measured C q s are lower than the Cutoff C q s for both control genes, then eliminate that cell from further analysis. 16 Application Guidance: Single-Cell Data Analysis

Section 1 In terms of eliminating genes from further analysis, clearly genes that are not detected in any of the single cells in the study can be eliminated. Optionally, genes expressed in fewer than 5% or 10% of the single cells can be eliminated. Usually, though, these lowly expressed genes do not adversely affect multivariate analyses such as hierarchical clustering or principal component analysis, so it is probably safer to leave these genes in the data set. Normalization It is not necessary to normalize Log 2 Ex data on a per-cell basis. In fact, many single-cell publications have not used a cell-based normalization. Normalization should be considered a variable that is tried to see if it has any significant effect on the analysis of the expression data. One way that normalization might be beneficial is if it reduces variation due to differing cell size. Normalizing to a single reference gene that is varying 10- to 1000-fold at the single-cell level does not make much sense. Guo et al. (2010) normalized using the average of ActB and Gapdh Log 2 Ex values. Vandesompele et al. (2002) describe the genorm method which is a robust way to use multiple reference genes to determine a normalization factor. One way to obtain a normalization factor that uses data from all the genes in the study is to normalize so that each cell has the same median Log 2 Ex value for all the genes detected in that cell. Here is one example where normalization does not seem to have much effect on data analysis. In Guo et al. (2010), they performed principal component analysis on expression data from 159 single cells derived from 64-cell stage mouse embryos. Prior to this analysis, they normalized their data using the average of ActB and Gapdh Log 2 Ex values. Here, their principal component analysis has been repeated using unnormalized data and Median Log 2 Ex normalized data: Figure 8 Example in which normalization does not have much effect on data analysis from Guo et al. (2010) The distributions of single cells in these three plots do not seem to be significantly different, indicating that normalization would have little effect on data interpretation in this particular case. Application Guidance: Single-Cell Data Analysis 17

Here is an example where normalization might be useful for reducing variation in the data. Below are the distribution of ACTB transcripts in 87 single K562 cells using unnormalized and Median Log 2 Ex normalized data: Figure 9 Unnormalized Figure 10 Normalized In this case, normalization reduces the standard deviation from 0.84 to 0.64, indicating some reduction in variation. 18 Application Guidance: Single-Cell Data Analysis

Section 1 Secondary Analysis Use of the C q method (Livak and Schmittgen, 2001) may not be the best method for identifying differences among the single cells being analyzed. Even if normalization issues are addressed by using data from multiple genes (see Normalization above), the C q method focuses on genes one at a time. With expression of each gene varying 10- to 1000-fold, it may be difficult to discern reliable patterns in data from any single gene. For lower expressed genes, there is also the fact that a transcript may not be detected in a particular cell purely due to stochastic noise and not due to the inherent identity of that cell. Rather, using some form of multivariate analysis, such as hierarchical clustering or principal component analysis, will probably be more fruitful in identifying subpopulations with similar gene expression signatures. The purpose of this document is to focus on the minimum steps required to process single-cell data in order to make it ready for secondary analysis, rather than explore all the different methods of secondary analysis that are available. In order to provide some guidance, what follows is a list of some of the publications that have used the BioMark system to obtain single-cell gene expression data and the methods of secondary analysis that were used in each paper. This is a good place to start in trying to find ways to analyze single-cell data in order to discover biological insights. Guo et al. (2010) Flatz et al. (2011) Dalerba et al. (2011) Diehn et al. (2009) Pang et al. (2011) Plus/Minus Pairwise Correlation Hierachical Clustering Principal Component Analysis (PCA) X X X Linear Discriminant Analysis (LDA) Decision Tree Analysis X X X X X X X X Vincent et al. (2011) X Aguilo et al. (2011) X Table 4 Publications that used BioMark System to obtain single-cell gene expression data and a comparison of secondary analysis methods Application Guidance: Single-Cell Data Analysis 19

Section 2 For the examples shown below, Heat Map Results were exported from the BioMark software to Microsoft Excel. For those adept at using pivot tables, data can be exported as Table Results and similar processing can be performed. Qualification of Assays 1 Run first chip using procedure in Appendix 1. a Annotate samples and detectors in the Sample Setup and Detector Setup windows, respectively. 2 Set T m range for each assay. a In the Heat Map View of the BioMark Real-Time PCR Analysis software, select the first column (first assay). Figure 11 First column selected b In the tool bar just below the Heat Map, select the list under Full Range, then select Auto Range. 20 Application Guidance: Single-Cell Data Analysis

Section 2 Figure 12 Auto Range selected Application Guidance: Single-Cell Data Analysis 21

c In the tool bar just below the Heat Map, select Threshold, then select Edit. This will activate the movable vertical temperature lines in the Melting Curve window. The default is the Minimum Temperature line at 60 and the Maximum Temperature line at 95. Figure 13 Threshold > Edit selected 22 Application Guidance: Single-Cell Data Analysis

Section 2 d Select the Minimum Temperature line and move it until it intersects the T m peak about halfway between baseline and peak. Only set the Minimum Temperature line. Leave the Maximum Temperature line at 95. Figure 14 Temperature line at 95 e Select the 2nd column (assay) and repeat the process. After the T m range has been set for all the assays, the file must be analyzed again to set all the T m ranges as part of the chip run file. To do so: f Make sure that the Baseline Correction Method is still set to Linear (Derivative). g Make sure the C t Threshold Method is still set to Auto (Global). h Click the Analyze button. Application Guidance: Single-Cell Data Analysis 23

i After the analysis has been performed, go to the Detector Setup window and Export the Detector.plt file with a suitable name for the set of assays being analyzed. The T m range information is retained as part of the Detector.plt file. When this same set of assays is used to analyze single cells, the assay information should be added to the chip run by using the Import button to import the Detector.plt file. The T m ranges selected as part of this qualification run will be automatically applied to the singlecell data. j Figure 15 Export of the Detector.plt file After saving the chip run file, go to File > Export to export the C t (C q ) data to Microsoft Excel as Heat Map Results. The exported file is a comma delimited text file (.csv file). Figure 16 Saving as.csv file 24 Application Guidance: Single-Cell Data Analysis

Section 2 3 Determine limit of detection (LOD) C q value using all assays. a Open the Heat Map Results file in Excel. Determining the most dilute sample that has positive amplification plots for all six replicates will take advantage of the fact that the Fluidigm Real-Time PCR Analysis software reports a C t value of 999 for any reaction where a positive amplification plot is not detected. b For each assay, average the C q values for the six replicates for the 15 dilution samples. If the average value is greater than 30, this indicates that at least one of the replicates did not have a positive amplification plot. Conditional formatting can be used to highlight instances where the average value is greater than 30 as shown here: Figure 17 Values greater than 30 highlighted Application Guidance: Single-Cell Data Analysis 25

c For each assay, note the most dilute sample that does not have a highlighted value and record that C q value as the preliminary LOD C q value for that assay. Then determine the median value for all the preliminary LOD C q values across all assays: Figure 18 Median LOD C q across all the assays d Round the median value up to the next highest whole cycle to obtain an estimate of the LOD C q. In this case, the LOD C q is 22. Primary Processing of Single-cell Data 1 Run second chip with single-cell samples. a Flow-sort, process and run single cells on a 96.96 Dynamic Array IFC following ADP 33 - Two-step Single-cell Gene Expression with Fast EvaGreen Supermix for BioMark/BioMark HD. This analysis uses 96 singlecell samples as an example. (It is often useful to include some control samples on the chip, but that will be the topic of subsequent documentation.) Fluidigm ADP 5 - Single-Cell Gene Expression Real-Time PCR Using Dynamic Array IFCS (PN 68000107) or ADP 41 - Single-Cell Gene Expression Using SsoFast EvaGreen Supermix on the BioMark or BioMark HD (PN 100-4109) may also be used. b In Sample Setup, annotate the sample information. c In Detector Setup, import the Detector.plt file generated in the assay qualification run. This will bring in the T m range for each assay. d The file must be analyzed again to incorporate the sample and assay information as part of the chip run file. To do this: 26 Application Guidance: Single-Cell Data Analysis

Section 2 1 Make sure that the Baseline Correction Method is still set to Linear (Derivative). 2 Make sure the C t Threshold Method is still set to Auto (Global). 3 Click the Analyze button. e After saving the chip run file, go to File > Export to export the C t (C q ) data to Microsoft Excel as Heat Map Results. The exported file is a comma delimited text file (.csv file). 2 Remove data failed by Real-Time PCR Analysis software. a Although the Fluidigm Real-Time PCR Analysis software fails reactions with an improper T m, it does not change the C q value determined from the amplification plot. Therefore, the first step in Excel is to eliminate C q values for reactions failed due to T m. 4 Open Heat Map Results file in Excel. Figure 19 Heat Map results in Microsoft Excel 5 Copy sample information in A113:B208 to A213:B308. 6 In cell C213, enter formula: =IF(C113="Pass",C13,999). Application Guidance: Single-Cell Data Analysis 27

Figure 20 Microsoft Excel formatting Figure 21 Microsoft Excel formatting 7 Copy formula in cell C213 to fill matrix C213:CT308. 8 Save file as a.xls or.xlsx file. 28 Application Guidance: Single-Cell Data Analysis

Section 2 3 Calculate Log 2 Ex values. Determine expression of each gene in each cell. a In cell B311, enter LOD C q value. Figure 22 Microsoft Excel formatting B311 b Copy sample information in A213:B308 to A313:B408. c Figure 23 In cell C313, enter formula: =IF($B$311-C213>0,$B$311-C213,0). Microsoft Excel formatting C313 d Copy formula in cell C313 to fill matrix C313:CT408. e Save file. f At this point, the Log 2 Ex values can be used for a variety of subsequent analyses. It is often convenient to copy the Log 2 Ex values to a fresh file using the Paste Values command and annotate with appropriate gene and sample information. This file can usually be imported into secondary analysis software packages to perform multivariate comparisons such as hierarchical clustering or principal component analysis. For these types of multivariate analyses, data is typically mean-centered for each gene and optionally the data is auto-scaled per gene as well. These manipulations can usually be done within the secondary analysis software packages. 4 Determine Cells/Gene and Genes/Cell. a The data can be processed to determine the number of cells where each transcript is detected. In order to do this, the Log 2 Ex = 0 values need to be eliminated so that undetected genes are not counted. b Copy sample information in B313:B408 to A413:A508. c In cell B413, enter formula: =IF(C313=0,"",C313). Application Guidance: Single-Cell Data Analysis 29

Figure 24 Microsoft Excel formatting C413 d Copy formula in cell C413 to fill matrix C413:CT508. e In cell C412, enter formula: =COUNT(C413:C508). This will calculate the number of cells where each transcript is detected (Cells/Gene). Figure 25 f g Microsoft Excel formatting C412 Copy formula in cell C412 to fill row C412:CT412. In order to calculate the number of genes detected in each cell (Genes/Cell), enter the formula =COUNT(C413:CT413) in cell B413. Copy formula to fill column B413:B508. Figure 26 Microsoft Excel formatting B413 30 Application Guidance: Single-Cell Data Analysis

Section 2 5 Eliminate cells with low expression from subsequent analysis. a Use the data in matrix C413:CT508 to calculate the median expression and standard deviation of each gene. In order to do this, enter the formula =MEDIAN(C413:C508) in cell C409, then copy formula to fill row C409:CT409; and formula =STDEV(C413:C508) in cell C410, then copy formula to fill row C410:CT410. Figure 27 Microsoft Excel formatting C409 b In cell C411, enter formula =C409-3*C410. Copy to fill row C411:CT411. This is the Cutoff C q value that is used for culling cells with low expression. c Going back to the Log 2 Ex data in matrix C313:CT408, Conditional Formatting (on Home tab) is used to highlight Log 2 Ex values that are below the Cutoff C q for two control genes (in this case, ACTB and GAPDH). Application Guidance: Single-Cell Data Analysis 31

Figure 28 Microsoft Excel formatting Figure 29 Figure 30 Microsoft Excel formatting Microsoft Excel formatting d Samples with both control values highlighted can be eliminated from further analysis. In this case, samples B1.A3 and B1.A9 can be eliminated. Figure 31 Samples to be eliminated e Save file. 6 Normalize using Median Log 2 Ex. a Copy sample information in A413:A508 to A513:A608. b Copy the data in matrix C413:CT508 and use the Paste Values command to place in matrix C513:CT608. 32 Application Guidance: Single-Cell Data Analysis

Section 2 c In cell B513, enter formula: =IFERROR(MEDIAN(C513:CT513), ). Copy formula to fill column B513:B608. Figure 32 Microsoft Excel formatting B513 d In cell B609, enter formula: =AVERAGE(B513:B608). Figure 33 Microsoft Excel formatting B609 e Copy sample information in A513: A608 to A613:A708. f In cell C613, enter formula =IFERROR(C513-($B513-$B$609),0). Copy formula to fill matrix C613:CT708. Figure 34 Microsoft Excel formatting C613 g Save file. h The median normalized values in matrix C613:CT708 can be used just like the Log 2 Ex values in 2.3. Again, it is convenient to copy the Log 2 Ex values to a fresh file using the Paste Values command and annotate with appropriate gene and sample information. Application Guidance: Single-Cell Data Analysis 33

7 Prepare expression histograms. a Another analysis that can be done in secondary analysis software is to use the exported Log 2 Ex (or Median Normalized Log 2 Ex) data to prepare expression histograms for individual genes. This can also be done in Excel. b For this example, the Log 2 Ex data in matrix C313:CT408 have been copied to B2:CS97 in a new sheet using the Paste Values command. Four culled samples (see 2.5) have been deleted, making the matrix B2:CS93. The assay information is in row B1:CS1. The sample information is in column A2:A93. c In cell B94, enter formula =MIN(B2:B93). Copy formula to fill row B94:CS94. In cell B95, enter formula =MAX(B2:B93). Copy formula to fill row B95:CS95. Figure 35 Microsoft Excel formatting B94 d In cell A98, enter whole number value that is just below minimum value. For this example, the data for ACTB in column B2:B93 is being used. In the column under A98, enter values increasing in increments of 0.5 until the whole number value just above the maximum value is entered. Figure 36 Microsoft Excel formatting A98 34 Application Guidance: Single-Cell Data Analysis

Section 2 e In the Data tab, select Data Analysis. If Data Analysis button is not visible, you will need to activate the Microsoft Excel Load Analysis Tool Pack. Click the Office button > Excel Options button > Select Add-ins > Select Analysis Tool Pack > Click OK. Figure 37 Microsoft Excel formatting f Select Histogram. Figure 38 Histogram selected Application Guidance: Single-Cell Data Analysis 35

g In the Histogram wizard, select Log 2 Ex values $B$2:$B$93 for Input Range: Select min-to-max range $A$98:$A$110 for Bin Range: Select Output Range: button and select cells $B$97:$C$97. Check box for Chart Output. Figure 39 Histogram wizard inputs h Select OK. Figure 40 i j Histogram result The Histogram chart can now be manipulated and annotated just like any other Excel chart. Save file. 36 Application Guidance: Single-Cell Data Analysis

Section 2 Appendix 1: Protocol for the Qualification of Assays For a more detailed protocol, see ADP 41 Single-Cell Gene Expression Using SsoFast EvaGreen SuperMix with Low ROX on the BioMark or BioMark HD System. Please contact Fluidigm Technical Support with any questions at techsupport@fluidigm.com 866.358.4354. Preamplification Mix: 8 µl 2.5 ng/µl Biochain Human Universal cdna (C4234565-R) 2 µl 500 nm each Preamp Primers (pool of all assays) 10 µl 2 AB TaqMan PreAmp Master Mix (4391128) Transfer to thermal cycler: Condition 1 Cycle 14 Cycles Hold Temperature 95ºC 95ºC 60ºC 4ºC Time 10 minutes 15 seconds 4 minutes Mix: 2 µl 20 units/µl Exonuclease I (New England BioLabs M0293L) 1 µl 10 Exonuclease I Reaction Buffer 7 µl H 2 O Add 8 µl to the preamplified sample. Transfer to thermal cycler: Condition 1 Cycle Hold Temperature 37ºC 80ºC 4ºC Time 30 minutes 15 minutes Add 72 µl TE (10 mm Tris, ph 8.0, 1.0 mm EDTA) (TEKnova, PN T0224). Store at - 20 C Prepare 1:2 Dilutions Prepare Diluted TE + 0.25% Tween20 1560 µl TE 40 µl 10% Tween20 Table 5 Diluted TE and Tween Application Guidance: Single-Cell Data Analysis 37

Dilution 1 4782969 100 µl Diluted preamplification sample from above Dilution 2 1594323 30 µl Dilution 1 +60 µl Diluent Dilution 3 531441 30 µl Dilution 2 +60 µl Diluent Dilution 4 177147 30 µl Dilution 3 +60 µl Diluent Dilution 5 59049 30 µl Dilution 4 +60 µl Diluent Dilution 6 19683 30 µl Dilution 5 +60 µl Diluent Dilution 7 6561 30 µl Dilution 6 +60 µl Diluent Dilution 8 2187 30 µl Dilution 7 +60 µl Diluent Dilution 9 729 30 µl Dilution 8 +60 µl Diluent Dilution 10 243 30 µl Dilution 9 +60 µl Diluent Dilution 11 81 30 µl Dilution 10 +60 µl Diluent Dilution 12 27 30 µl Dilution 11 +60 µl Diluent Dilution 13 9 30 µl Dilution 12 +60 µl Diluent Dilution 14 3 30 µl Dilution 13 +60 µl Diluent Dilution 15 1 30 µl Dilution 14 +60 µl Diluent NTC 60 µl Dilutuent +60 µl Diluent Table 6 Dilution table These dilutions are made in 1.5 ml tubes with vortexing and centrifugation after each dilution. Samples are then transferred to 96-well plate for ease of loading into IFCs. Store at -20 C. qpcr Detection Prime chip. Mix: 420 µl 2X Sso Fast EvaGreen Supermix with Low ROX 42 µl 20X DNA Binding Dye Sample Loading Reagent 18 µl H 2 O Add 20 µl to each well of 16 wells, the first two columns of the 96-well plate. Add 15 µl of Diluted sample to each well. Vortex gently and centrifuge. Dispense 5 µl of DELTAgene Assays to Detector Inlets of 96.96 array. 38 Application Guidance: Single-Cell Data Analysis

Section 2 Dispense 6 5 µl of each Dilution sample + Sso Fast MM to Sample Inlets of 96.96 array. Load chip. Run GE Fast 96x96 PCR+Melt v2.pcl Segment Type Temperature ( C) Duration (seconds) BioMark HD Ramp Rate ( C/s) 1 Thermal Mix 70 2400 5.5 60 30 5.5 2 Hot Start 95 60 5.5 3 PCR (30 Cycles) 4 Melting Curve 96 5 5.5 60 20 5.5 60-95 1ºC / 3 seconds Analyze the data using the Linear (Derivative) Baseline Correction Method and the Auto (Global) C t Threshold Method. (For probe-based assays, use the Auto (Detector) C t Threshold Method.) Application Guidance: Single-Cell Data Analysis 39

References Aguilo, F., S. Avagyan, A. Labar, A. Sevilla, D. F. Lee, P. Kumar, I. R. Lemischka, B. Y. Zhou, and H. W. Snoeck (2011) Prdm16 is a physiologic regulator of hematopoietic stem cells, Blood 117:5057-5066. Bengtsson, M., A. Ståhlberg, P. Rorsman, and M. Kubista (2005) Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mrna levels, Genome Research 15:1388-1392. Chubb, J. R., T Trcek, S. M. Shenoy, and R. H. Singer (2006) Transcriptional pulsing of a developmental gene, Current Biology 16:1018-1025. Dalerba, P. et al. (2011) Single-cell dissection of transcriptional heterogeneity in human colon tumors, Nat Biotechnol 29:1120-1127. Devonshire, A. S., R. Elaswarapu, and C. A. Foy (2011) Applicability of RNA standards for evaluating RT-qPCR assays and platforms, BMC Genomics 12:118-127. Diehn, M. et al. (2009) Association of reactive oxygen species levels and radioresistance in cancer stem cells, Nature 458:780-783. Flatz, L. et al. (2011) Single-cell gene-expression profiling reveals qualitatively distinct CD8 T cells elicited by different gene-based vaccines, Proc Natl Acad Sci USA 108:5724-5729. Guo, G., M. Huss, G. Q. Tong, C. Wang, L. L. Sun, N. E. Clarke, and P. Robson (2010) Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst, Developmental Cell 18:675-685. Livak, K. J. and T. D. Schmittgen (2001) Analysis of relative gene expression data using real-time quantitative PCR and the 2- CT method, Methods 25:402-408. Pang, Z. P. et al. (2011) Induction of human neuronal cells by defined transcription factors, Nature 476:220-223. Raj, A., C. S. Peskin, D. Tranchina, D. Y. Vargas, and S. Tyagi (2006) Stochastic mrna synthesis in mammalian cells, PLoS Biol 4:e309. Vandesompele, J., K. De Preter, F. Pattyn, B. Poppe, N. Van Roy, A. De Paepe, and F. Speleman (2002) Accurate normalization of real-time quantitative RT- PCR data by geometric averaging of multiple internal control genes, Genome Biology 3:research0034.1-research0034.11. Vincent, J. J. et al. (2011) Single cell analysis facilitates staging of Blimp1- dependent primordial germ cells derived from mouse embryonic stem cells, PLoS ONE 6:e28960. 40 Application Guidance: Single-Cell Data Analysis

World Headquarters 7000 Shoreline Court, Suite 100 South San Francisco, CA 94080 USA Tel: 650-266-6000 Fax: 650-871-7152 Fluidigm Europe, BV Parnassustoren Locatellikade 1, 1076 AZ Amsterdam Netherlands Tel: +33 (1) 60 92 42 40 Fax: +31 (0) 20 203 1111 Fluidigm Japan KK Level 5, Ginza TK Building 1-1-7 Shintomi Chuo-ku, Tokyo 104-0041 Japan Office: +81335552351 Fax: +8133552353 Fluidigm Singapore PTE LTD Block 1026 Tai Seng Avenue #07-3532 Singapore 534413 Office: +6568587316 Fax: +6562825531 Technical Support send email to: TechSupport@fluidigm.com Phone in United States: 1.866.FLUIDLINE (1.866.358.4354) Outside the United States: 650.266.6100 On the Internet: www.fluidigm.com/support Visit our website at www.fluidigm.com PN 100-5066, Rev. A1 Fluidigm Corporation