SPADE is an algorithm available in Cytobank that takes multi-parameter data, performs clustering, and represents the clustered data as a two-dimensional minimum spanning tree of connected clusters. There are a small number of configuration steps necessary for running SPADE. This article provides an overview of these configuration steps and concepts to consider for setting up the analysis. Click the links below to jump to any section in this article.
- Target Number of Nodes
- Downsampling Target
- Select a Population
- Select Clustering Channels
- Fold Change Configurations
- Scale Settings and SPADE
- Include and Exclude Samples from a SPADE Run
- Adjust the Number of Events in the Analysis
The target number of nodes for a SPADE run determines how many clusters will be present in the results. The correct number of clusters to select presents a sort of "Goldilocks problem". Setting the target number of clusters lower simplifies the tree but increases the chances of a rare or subtle population being undesirably clustered into an existing cluster with a cell population that is actually different. This effect can be referred to as underclustering. The alternative to this effect is setting the target number of clusters to a higher number. A larger number of clusters will increase the chance of correctly isolating a rare or subtle population, but also increases the amount of noise in the results, since homogeneous abundant populations will be undesirably split across many clusters. Trending toward overclustering is likely desired in most cases because it can be mitigated during analysis by manually consolidating similar nodes within bubbles, whereas it can be hard or impossible to detect consequences of underclustering. Thus, a middle road needs to be taken that provides enough clusters for rare or subtle populations while not overly overclustering the dataset. This useful value for target number of clusters will depend on preexisting knowledge of the complexity of the dataset, the number of channels in the data files, the needs of the researcher, and empirical evaluation of results from multiple runs on the same dataset.
Downsampling in context of SPADE refers to density-dependent downsampling. This routine operates before clustering on the data passed to SPADE. Density-dependent downsampling detects regions of density within a dataset and removes events in order to normalize the density across the dataset. The overall effect is that the structure and distribution of the data will remain consistent but areas of high abundance/redundancy will be lowered. This redistribution favors rare cell types to form their own clusters instead of otherwise being outnumbered by abundant cell types.
Downsampling is done on a per-file basis according to a percentage value or absolute number set by the researcher. When a percentage is used, each file is downsampled until the percent of events remaining in the file is equal to the target. For example, a file with 1000 events and a percent downsampling target of 10% would have 900 events removed for a total of 100 events remaining after downsampling. In the case of an absolute number target, the file is downsampled until that absolute number is hit. Note that references to files in this paragraph actually imply the number of events in the file after events are filtered according to gates for the chosen population. Event/gate filtering happens before downsampling.
Choose a population on which to run SPADE from the Population box on the SPADE setup page. This box is filled with the populations created in the Gating Interface of this Experiment. Only one population can be chosen for a given SPADE run. Only the events within the selected population across all samples in the experiment will be passed onto the SPADE algorithm for analysis. Note that the events pulled from the selected files will adhere to any gate tailoring done in the experiment. If populations are missing from the list, make sure the most recent version of changes in the gating interface has been applied to the experiment.
Which population is chosen for the SPADE run is a consequence of the analysis goal. Researchers interested in analysis of all available phenotypes should run SPADE on a high level population that has had basic cleanup gating, such as CD45+ viable singlets. This strategy is very typical. However, researchers looking to probe a smaller compartment (e.g., T cells only) would choose a more downstream gated population such as CD3+ cells on which to run the SPADE analysis.
Which channels are chosen for SPADE analysis depends on the goals of the analysis. In most cases, the goal of a SPADE analysis is to automatically categorize events of interest into phenotype clusters for final categorization and analysis. If this is the case, then SPADE will need any channels that can be used to identify phenotypically distinct cell subsets. This is generally CD markers and non-CD phenotyping markers such as HLA-DR, IgM, CCR7, etc.
Signaling markers or other functional or dynamic markers can be chosen for clustering in SPADE as well. All the same principles of running and analyzing apply, but an understanding of the nuances of this approach and the analysis strategy should be understood before running SPADE on these markers. Most of the time, signaling markers will not be selected for clustering, and will instead be analyzed after the SPADE tree is created.
A common mistake in choosing channels for a SPADE analysis is including linearly scaled channels among an otherwise non-linearly scaled collection of channels. Examples of linearly scaled channels can be forward scatter in fluorescence cytometry and cell_length or event_length in mass cytometry (CyTOF). Other channels can be linear as well, and scaling in general should be checked before a SPADE analysis. To learn more about the effect of scale settings on SPADE results, consult the scales section within this article
Another common mistake seen in choosing channels for SPADE is choosing channels that don't contribute to the interpretable categorization of events, but will still affect the analysis results. A good example of this is the time channel. Choosing the time channel in a SPADE analysis will often create nonsensical results. Time itself does not help identify a cell population and thus is not useful for categorization by SPADE. Note that time is often useful to gate out areas of the data that were not captured with high fidelity, but this manual gating step should be done before the SPADE analysis (read more about time gates).
Note: the channels available to cluster on for a SPADE analysis must be common to all files in the experiment. If multiple files have different sets of channels, only those channels common to all files will be available for selection in SPADE. Despite not being able to cluster on channels unique to certain files, the data for these channels will still be present for statistics and visualization after the SPADE run is complete.
Fold change analysis between samples using SPADE is a powerful workflow. Read about it in a dedicated article: Fold Change with SPADE.
It is essential to set up the scales correctly before the runs. For scale settings and SPADE, the general rule is that "SPADE sees what you see" based on the scale settings in the experiment. The exception to this rule, however, is that SPADE doesn't pay attention to scale min or max. Meaning, if data are piled on the edge of the plot, SPADE doesn't see them in this fashion. It sees the data in the natural continuum uninterrupted by scale min and max. The one scale setting affecting SPADE is the scale argument of arcsinh scales. In general, if data are scaled appropriately for normal analysis involving manual gating, then the scale settings are fine for SPADE analysis. Please refer to the scaling support article and the blog post on how to scale cytometry data effectively for more details.
Note that having log based scales may cause SPADE failures due to attempted calculation of the logarithm of a negative number or zero, which are common values within cytometry data. For log-like scaling that handles negative numbers and zero, please use arcsinh scales. Learn more about scale settings.
By default, all sample files within an experiment are included in the SPADE run. Samples cannot currently be dynamically selected for inclusion in the analysis. Use the Selective Clone functionality to create different experiments that are based on the original experiment but with different sets of files. Alternatively, delete samples from the experiment.
Currently the overall number of events in the analysis and the events contributed per file cannot be adjusted. Consider using time gates to create populations with fewer events.