Clustering is a statistical process in which events or observations from raw data are analyzed across a number of parameters and then assigned into groups called "clusters". One use case of clustering might be to take 100,000 events and assign them into 50 clusters. After the clustering, each one of those 100,000 events will belong to a cluster, and thus have an identifier of which cluster it belongs to. The identifiers will be numbers between 1 and 50. Consider this example SPADE tree with 50 nodes:
(a spade tree with 50 nodes)
If a subset of this same clustered data were visualized in a spreadsheet form:
(The clustered data from the SPADE tree - a cluster_id column is added to the original data to indicate which cluster each event belongs to. This principle is the same regardless of which clustering algorithm is used.)
In Cytobank, if FCS files are exported from a SPADE tree (or imported from a clustering done elsewhere), the resulting files will have a cluster channel that can be used for downstream analysis. Plotting the cluster channel (with scales set to linear) displays the discrete cluster identifier values. These are example plots based on the SPADE tree above:
(cluster channel versus a marker) (cluster channel versus itself)
Annotation variables can be used in the same way as cluster identifiers to define groups of events or observations based on known variables. For example, in data containing one row per sample and one column per RNA transcript, you might know that some samples were given treatment 1 and some samples were given treatment 2. You can create an extra channel with this information and use it like the cluster channel described above. These annotation variables need to be coded as integers in order to be used in Cytobank. An example of how these annotation variables might be coded is shown here:
(Columns have been added for annotation variables representing gender, disease type, tumor stage, and sample type that indicate which group each sample belongs to for each of these variables. These channels can be used as cluster channels in the cluster gating workflow described below.)
Drawing gates around clusters (including automatically)
Cluster gating is drawing a gate around the events or observations with a certain value in the cluster channel in order to isolate events belonging to a particular cluster or annotation group. This can be useful for downstream analysis (see section below). Drawing gates around many clusters can be challenging and/or tedious. Cytobank Support has a tool that can be enabled in the gating interface that will generate cluster gates for all clusters. Create a support ticket to request access to this functionality!
(gates applied to all clusters automatically - only showing subset of clusters)
A current limitation in this approach is that cluster gates that include collections of clusters can't readily be made. For example, it might be desired to have a gate that represents clusters 1, 17, 19, 20, and 45. Currently there is no simple way to do that besides drawing a tricky polygon gate that encircles the desired clusters while excluding others.
Applications of cluster gating
Visually compare clustering algorithm results
Use cluster gating in concert with colored overlay dot plots and viSNE to create a figure that show the results of viSNE and the results of a clustering algorithm at the same time. This is a way of visualizing the results of any clustering algorithm or multiple clustering algorithms on a viSNE map:
(Cluster results for a 10 node SPADE tree are colored on a viSNE plot. Both algorithms were run on the same data with the same channel selections. Areas where colors disagree show disagreements in the categorization tendency of either algorithm.)
Visually compare annotation variables with high dimensional analysis results
Use cluster gating in concert with colored overlay dot plots and viSNE to visualize how annotation variables are correlated with samples that are found to be similar based on viSNE or clustering algorithms.
(An annotation variable representing cell line is colored on a viSNE plot. viSNE was run using multiple RNA and protein biomarkers measured in these cell lines to group similar samples. Using this coloring scheme, we can see that the biomarker expression signature across all markers is correlated with cell line.)