Deep profiling of multitube flow cytometry data Supplemental information

Similar documents
Flow Cytometry for Everyone Else Susan McQuiston, J.D., MLS(ASCP), C.Cy.

Stepcount. Product Description: Closed transparent tubes with a metal screen, including a white matrix at the bottom. Cat. Reference: STP-25T

Minimal residual disease detection in Acute Myeloid Leukaemia on a Becton Dickinson flow cytometer

COMPENSATION MIT Flow Cytometry Core Facility

Immunophenotyping Flow Cytometry Tutorial. Contents. Experimental Requirements...1. Data Storage...2. Voltage Adjustments...3. Compensation...

Subtypes of AML follow branches of myeloid development, making the FAB classificaoon relaovely simple to understand.

Pathology No: SHS-CASE No. Date of Procedure: Client Name Address

Introduction to Flow Cytometry

Immunophenotyping peripheral blood cells

Technical Bulletin. Threshold and Analysis of Small Particles on the BD Accuri C6 Flow Cytometer

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler


FlowMergeCluster Documentation

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

The NIAID Flow Cytometry Advisory Committee; the Guidelines Subcommittee

CyTOF2. Mass cytometry system. Unveil new cell types and function with high-parameter protein detection

No-wash, no-lyse detection of leukocytes in human whole blood on the Attune NxT Flow Cytometer

123count ebeads Catalog Number: Also known as: Absolute cell count beads GPR: General Purpose Reagents. For Laboratory Use.

Exploratory data analysis (Chapter 2) Fall 2011

Analyzing Flow Cytometry Data with Bioconductor

DELPHI 27 V 2016 CYTOMETRY STRATEGIES IN THE DIAGNOSIS OF HEMATOLOGICAL DISEASES

Clustering & Visualization

BD FACSDiva TUTORIAL TSRI FLOW CYTOMETRY CORE FACILITY

CELL CYCLE BASICS. G0/1 = 1X S Phase G2/M = 2X DYE FLUORESCENCE

Application Note 10. Measurement of Cell Recovery. After Sorting with a Catcher-Tube-Based. Cell Sorter. Introduction

Data Exploration Data Visualization

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

CyAn : 11 Parameter Desktop Flow Cytometer

INSIDE THE BLACK BOX

BD CellQuest Pro Software Analysis Tutorial

Demographics of Atlanta, Georgia:

Cluster Analysis for Evaluating Trading Strategies 1

Multicolor Bead Flow Cytometry Standardization Heba Degheidy MD, PhD, QCYM DB/OSEL/CDRH/FDA Manager of MCM Flow Cytometry Facility

APPLICATION INFORMATION

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

BD Trucount Tubes IVD

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Automated Quadratic Characterization of Flow Cytometer Instrument Sensitivity (flowqb Package: Introductory Processing Using Data NIH))

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Multicolor Flow Cytometry: Setup and Optimization on the BD Accuri C6 Flow Cytometer

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Instructions for Use. CyAn ADP. High-speed Analyzer. Summit G June Beckman Coulter, Inc N. Harbor Blvd. Fullerton, CA 92835

Potency Assays for an Autologous Active Immunotherapy (Sipuleucel-T) Pocheng Liu, Ph.D. Senior Scientist of Product Development Dendreon Corporation

University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

These particles have something in common

How To Read Flow Cytometry Data

Analyzer Experiment setup guide for LSRII and Canto s

Standardization, Calibration and Quality Control

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

The Scientific Data Mining Process

Environmental Remote Sensing GEOG 2021

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Introduction to the BioConductor framework Algorithmic Analysis of Flow Cytometry Data (Part 1)

Data Mining and Visualization

Multivariate Analysis of Ecological Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Principles of Flowcytometry

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Selected Topics in Electrical Engineering: Flow Cytometry Data Analysis

Summarizing and Displaying Categorical Data

Cluster Analysis: Advanced Concepts

Analyzing Samples for CD34 Enumeration Using the BD Stem Cell Enumeration Kit

COC131 Data Mining - Clustering

Using multiple models: Bagging, Boosting, Ensembles, Forests

BD FACSComp Software Tutorial

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Scatter Plots with Error Bars

Emerging New Prognostic Scoring Systems in Myelodysplastic Syndromes 2012

Using CyTOF Data with FlowJo Version Revised 2/3/14

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Data exploration with Microsoft Excel: analysing more than one variable

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Local outlier detection in data forensics: data mining approach to flag unusual schools

II. RELATED WORK. Sentiment Mining

Uses of Flow Cytometry

An Introduction to Point Pattern Analysis using CrimeStat

Exploratory Data Analysis for Ecological Modelling and Decision Support

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Factors affecting online sales

BD FACSCalibur. The flow cytometer for your routine cell analysis needs

Gates/filters in Flow Cytometry Data Visualization

Dongfeng Li. Autumn 2010

A. FSC and SSC gating of total BM cells. B. Gating strategy used to identify the Lin -

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Supplementary Material. Free-radical production after post-thaw incubation of ram spermatozoa is related to decreased in vivo fertility

Compensation Basics - Bagwell. Compensation Basics. C. Bruce Bagwell MD, Ph.D. Verity Software House, Inc.

Supplementary Materials for

Distances, Clustering, and Classification. Heatmaps

Data Analysis for Yield Improvement using TIBCO s Spotfire Data Analysis Software

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Transcription:

Deep profiling of multitube flow cytometry data Supplemental information Kieran O Neill et al December 19, 2014 1

Table S1: Markers in simulated multitube data. The data was split into three tubes, each containing CD3, CD4 and CD8 in addition to FSC and SSC. The remaining nine markers were distributed across the tubes, three per tube. Marker Type Tube 1 Tube 2 Tube 3 Common (scatter) FSC FSC FSC Common (scatter) SSC SSC SSC Common (fluorescent) CD3 CD3 CD3 Common (fluorescent) CD4 CD4 CD4 Common (fluorescent) CD8 CD8 CD8 Phenotyping (fluorescent) KI67 CD57 CD27 Phenotyping (fluorescent) CD28 CCR5 CCR7 Phenotyping (fluorescent) CD45RO CD19 CD127 2

Figure S1: Overview of the flowbin pipeline, applied to one multitube sample. 1) FCM data from individual aliquot tubes is quantile normalised in terms of the common population markers present in every tube. 2) The tubes are then binned in terms of these population markers, using either K-means or flowfp. 3) The bins from the first tube are mapped to the other tubes (by nearest-neighbour mapping for K-means bins, or directly for flowfp bins). 4) The expression of each bin in terms of each phenotyping marker (those markers differing across tubes) is measured. This may be done by taking median fluorescent intensity, normalised median fluorescent intensity, or proportion of cells exceeding the 98th percentile of a negative control. The final result is a highdimensional matrix containing expression levels for each bin in terms of each unique marker. 3

Bone Marrow FCS Data Markers FSC Aliquot, stain, run flow cytometry Scatter Tube 1 Tube 2 Combine tubes using flowbin 129 patients 7-10 tubes/patient 20,000 cells, 3-4 markers/tube SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 Cell Clusters CD3 FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3 128 clusters/patient 17 markers each Cell Clusters Cell Type Proportions FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3 Type clusters using flowtype (1:6-combinations) + - + - Wilcoxon rank sum vs patient NPM1 with Holm correction 801 cell types associated with NPM-1 FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3 128 clusters/patient 17 markers each 616,285 cell types per patient Figure S2: Pipeline used to determine NPM1-associated immunophenotypes in AML. Steps taken are denoted by arrows, while the data consumed/produced is indicated in boxes. FCM was performed in the clinic historically; all other steps were computational. The end result was a list of 801 cell types which showed a significant difference in abundance between NPM1 mutated and wild-type patients. 4

Figure S3: One, two and three-dimensional representations of quantile normalisation of population markers. Empirical cumulative density function (ECDF) plots are shown for all tubes and for forward scatter (FS), the most variant marker. Following normalisation, the ECDF for all tubes is identical, as is expected from quantile normalisation. Two-dimensional scatter plots for representative tubes show visually the improvement in two-dimensional registration. Lastly, flowfp plots show the improvement in three-dimensional registration, measured by the standard deviation of the number of cells falling within each bin, after bins have been fitted to the consensus of all tubes. 5

Figure S4: The two options for binning within flowbin: k-means and flowfp, as applied to a 7-tube sample. a. and b. show comparisons between the bin labels themselves. K-means creates roughly spherical bins, which conform around the location of cell populations. FlowFP creates grid-like bins,which may not conform to the true underlying shape of cell populations. c. shows the number of cells per bin across all tubes, for every bin. flowfp has approximately the same mean distribution of bin density across tubes as K-means (mean SD: 24.6 vs 28.5). However, flowfp has a much closer to constant number of cells per bin across bins (SD of means: 0.07 vs 255). 6

Figure S5: Comparison between nearest-neighbours merging and flowbin for two tubes computationally sampled from a real data set. a. Raw data (compensated, transformed and filtered for debris), gated for CD3 + cells, and showing the true CD4 and CD8 distribution. b. The two sampled tubes, one containing CD4 and the other CD8. The CD4 + population has slightly higher average CD3 than the CD8 +, but both have substantially overlapping CD3 distributions. c. Results of merging by nearest neighbours and by flowbin, including proportion of resulting cells falling within each quadrant. The nearest-neighbours merging created a substantial CD4 + CD8 + population not present in the original sample. Both nearest neighbours and flowbin slightly overestimate the CD4 CD8 population. flowbin is more accurate at reproducing the CD4 + CD8 and CD4 CD8 + populations than nearest neighbours. d. and e. This analysis was repeated 100 times each for each number of bins, with a separate sampling of 5,000 events each. d. Representative results (those with median RMSD) for selected numbers of bins. e. All results for all numbers of bins and NN merging. The best result (lowest RMSD) was for 128 bins, whereafter increasing bin number caused RMSD to tend towards that of NN. 7

Figure S6: nu-svm separation of normal and abnormal cell populations in AML samples. a. Heatmap of all populations within the AML samples that were predicted to be normal. Most can readily be identified as having the properties of common blood and bone marrow cell populations: myeloid cells expressing CD16 and/or CD64, lymphoid cells (dominated by CD3-expressing T-lymphocytes/precursors, and erythroid cells not expressing any of the markers in the panel, including CD45. b. Heatmap of all populations predicted to be abnormal. In contrast to the cells predicted to be normal, many of these express CD34 and CD117, primitive markers typical of stem cells and of AML. a. b. Training data patient 1 patient 2 pop 1) 0.48 0.44 0.78 pop 2) 0.67 0.45 0.34 pop 3) 0.74 0.89 0.12 pop 1) 0.35 0.46 0.67 pop 2) 0.21 0.56 0.49 pop 3) 0.78 0.41 0.89 patient 1 Test data (all bins from one patient) pop 1) 0.48 0.44 0.78 pop 2) 0.67 0.45 0.34 pop 3) 0.74 0.89 0.12 Classifier Predicted classes pop 1) AML pop 2) AML pop 3) healthy Classifier Training Algorithm Classifier patient 1 Training classes pop 1) AML pop 2) AML pop 3) AML Take vote patient 2 pop 1) healthy pop 2) healthy pop 3) healthy Predicted class Patient 1) AML Figure S7: Schema for a voting classifier using flowbin output. a. Training. Every bin from every patient is treated as an individual measurement, labelled with the class of the patient. A classifier is then trained on the entire dataset at once (all bins from all patients). b. Prediction To predict the class of a new patient, a prediction is made by the trained classifier for every one of the bins from the patient. The majority vote from these predictions is then taken as the overall prediction for the patient. 8

a. sample 1 data 2) 0.21 0.56 0.49 3) 0.78 0.41 0.89 5) 0.33 0.43 0.47 Training data 1) 0.35 0.46 0.67 2) 0.21 0.56 0.49 3) 0.78 0.41 0.89 sample 1 classes 2) healthy 3) AML 5) healthy Classifier Training Algorithm Classifier 1 sample 2 data Training classes 1) healthy 2) AML 3) healthy Subsample 1) 0.35 0.46 0.67 3) 0.78 0.41 0.89 8) 0.12 0.31 0.71 sample 2 classes 1) healthy 3) healthy 8) AML Classifier Training Algorithm Classifier 2 b. Predicted classes Test data 1) 0.22 0.26 0.65 2) 0.34 0.24 0.45 3) 0.67 0.34 0.46 Classifier Classifier 1) AML 2) AML 3) healthy Predicted classes 1) AML 2) healthy 3) healthy Take vote Final classes 1) AML 2) AML 3) healthy Predicted classes Take vote Classifier 1) AML 2) AML 3) AML Patient class Patient 1) AML Figure S8: Schema for a voting classifier for flowbin output incorporating balanced bagging. a. Training. This is similar to the base classifier (Fig. S7), except that multiple classifiers are trained, each on a bootstrap subsample of patients. Each bootstrap sample is set to contain equal numbers of patients from each class. b. Prediction To predict the class of a new patient, predictions for each bin from that patient are made by each of the trained classifiers. Final per-bin predictions are taken by majority vote of those predictions. Then, the prediction for the patient is made based on a majority vote of the per-bin predictions. 9

1 2 3 4 5 6 7 8 9 -log10(p-value) All Cells CD34+ CD34+ CD61 CD34+CD61 CD14 CD34+CD61 CD14 CD10 CD20 CD34+CD10 CD61 CD14 CD34+CD20 CD61 CD14 CD20 CD34+CD20 CD10 CD61 CD14 CD3+ CD34+CD20 CD10 CD61 CD14 CD3+ Figure S9: An example of RchyOptimyx analysis of one cluster of cell types. As 801 cell types are too many to visualise meaningfully with RchyOptimyx, we clustered the cell types and visualised each in turn. In this example, the addition of CD10- or CD20- make little difference to the P-value of the cell type CD34 + CD61 CD14. As this was a general trend and in line with reported AML biology, we chose to exclude cell types defined over these markers from further analysis. 10

q q q q q q a. Proportion of all cells b. Proportion of all cells c. Proportion of all cells d. Proportion of all cells 1 0 1 0 1 0 1 0 wt CD34- P>0.05 mt CD34-CD2- P=0.0265 CD34-CD13+ P=0.00221 wt mt CD34-CD2-CD4+ P=0.000149 CD34-CD33+ P=0.00225 CD13+CD34 CD33+ P=0.00138 wt mt wt mt CD13+CD34-CD2-CD4+ P=1.94e-06 wt mt wt mt wt mt CD34+CD61-CD14- P=0.015 wt mt HLA+CD34+CD33-CD64- P=0.0235 CD34+CD61-CD14-CD2+ P=0.000114 CD34+CD61-CD2+CD4- P=0.00548 wt mt wt mt HLA+CD34+CD4-CD64- P=0.0119 wt mt wt mt Figure S10: Selected classes of cell types showing significant differences in abundance between NPM1-mt and NPM1-wt. P-values are given after Holm correction. a. Gating for the presence of myeloid lineage markers CD13 and CD33 within the CD34- compartment yields much stronger differences in abundance between NPM1-wt and NPM1-mt than CD34- alone. b. Gating for CD2- within the CD34- compartment yields a slightly better separation than CD34- alone, but gating down further to CD4- and CD13+ is a cell type that, while present in most NPM1-mt, is absent or below 20% abundance in nearly all NPM1-wt. c. Gating for CD61- and CD14- within the CD34+ compartment leads to a cell type which is common in NPM1-wt but almost entirely absent in NPM1-mt. d. Gating for HLA-DR+ and CD64- within the CD34+ compartment leads to a cell type that occurs in a subset of NPM1-wt but is entirely absent in NPM1-mt. 11