- Selecting Populations and Samples
- Event Count Sampling
- Selecting Clustering Clustering Channels
- Selecting Compensation
- Scale Settings
- SOM Creation
- Choosing a Target Number of Clusters
- Choosing a Target Number of Metaclusters
- Choosing a Metaclustering Method
- Configuring Iterations
- Setting the Seed
- Effect of FlowSOM Settings on Algorithm Run Time
- Configuring PDF Output
- Populations for pie charts
- Channels to plot
- Cluster sizing
- Output file type
- Toggle metacluster background in plots
Selecting Populations and Samples
The first step in running a FlowSOM analysis is choosing one or more populations from which the events will be sourced, and which samples (i.e. files) will be used for the analysis. The populations available in this list are taken from manual gating done in the experiment on Cytobank, including any tailored gates. If populations are missing from the list or event counts seem incorrect, make sure the most recent version of changes in the gating interface has been applied to the experiment.
The correct population for a FlowSOM analysis will vary by experiment; some users may want a more objective look at the data and using a more general population (such as CD45+ or Singlets) will provide this. Other users may be interested in a single subset and selecting a more specific population (such as CD4+) will narrow the focus of the analysis. Regardless, we recommend manually gating to at least Live Singlets to allow for a cleaner representation of the data.
After a population is selected, a list of samples is visible under the FCS Files section. Click on the filenames to choose the samples that should be included in the FlowSOM analysis. Note that samples designated as experimental controls will not appear for selection.
The number of events that will be included (sampled) from each population/file is displayed next to the total available events and will be dependent on the next configuration step: event sampling.
Event Count Sampling
With FlowSOM in Cytobank, you can include up to 4 million events on Premium and Enterprise Cytobank with basic compute. FlowSOM can run up to 90 million events on Enterprise servers with a compute upgrade. Learn more about different Cytobank offerings.
You can adjust the total number of events that will be included in your FlowSOM run in the 'Event Sampling' box on the setup page. If your selected populations from files include more events than shown in the ‘Desired Events Per File’, a subset of events in the populations will be selected randomly for inclusion in the analysis. In this case, there are three options for controlling the way that the events will be chosen:
1) Equal sampling - Each population/file will be sampled equally. The number of events sampled will be equal to the user-entered ‘Desired Events Per File’.
2) Proportional sampling - Each population/file combination present will be sampled according to its relative abundance as a function of total available events. The percentages will match between the available events and the sampled events.
3) Use all events - All events from each file will be included if the total number of events is lower than the specified event limit on the server.
How many events to sample is guided by how many files are present in the analysis, the analysis goals, and the time constraints of the researcher. If you have more samples/files to compare and/or are studying rare populations, you will need a higher overall event count to be able to include enough events from each file to see the populations you are interested in on the FlowSOM analysis, but this will increase the time it takes to run your analysis.
Whether to use proportional or equal sampling will depend on the availability of events within the dataset, the relative abundance of each file, and the goals of the analysis.
Equal sampling is a good choice for most applications or if you aren’t sure. Using equal sampling makes visualization of differences in the structure and relative abundance of populations across files easy. It’s also required by CITRUS when comparing population abundances, and so if this FlowSOM will be linked with a viSNE that will be used to visualize CITRUS results, it makes sense to use equal sampling here. It’s also perfectly acceptable to use equal sampling if your intention is to compare functional marker expression in a given population(s) across samples. The one case in which equal sampling may not be ideal is when you have one or more files that have very low event counts compared to most of the other files, because these files would then limit the number of events taken from all the files. In this case, you can either exclude these low event count files or use proportional sampling.
Proportional sampling should be used in certain special circumstances. For example, use proportional sampling when the individual files vary widely in the number of available events and you don’t want to exclude ones with low event counts. It is acceptable to use proportional sampling and then compare population-specific functional marker expression analysis algorithms like CITRUS, and doing so can help include samples in these analyses even when the samples have a wide range of event counts.
Use all events is appropriate when you have fewer events in your experiment than the server event cap and are interested in using all of them in the analysis. There is no random down-sampling that happens in this case, as all events are used. This option does not enforce Equal or Proportional sampling.
Selecting Clustering Channels for FlowSOM
The channels or markers selected for FlowSOM are what the algorithm uses to identify events or observations that are similar and put them in clusters and metaclusters. Channels that you do not select as clustering channels will still be passed through to the resulting FlowSOM analysis experiment for downstream analysis (e.g. readouts). The following major points should be considered when selecting channels or markers for a FlowSOM analysis:
1) Select the channels that may contribute to separating cell subsets or groups of samples
For single cell data, at a minimum this should include any phenotyping channels you would typically require to gate cell subsets of interest. This will allow you to use FlowSOM in a pipeline that replaces manual gating and to compare your manually gated populations easily. Advanced workflows may also include functional marker channels, but this may make interpretation of results more difficult.
For bulk or pooled data, selected channels should include any markers that you think may contribute to separating groups of samples in your data. Typically, this will include all markers in your file that were not used for data pre-processing (e.g. normalization), annotation, or stratification.
2) Exclude channels that are used for data pre-processing
Data pre-processing includes pre-gating steps that are done in Cytobank and steps such as data normalization or QC that are done upstream. Channels like this that have already been used in the analysis pipeline should generally be excluded in FlowSOM. For single cell data, examples of these channels include scatter, DNA content, viability, bead, and time channels. If your pre-gated population is something like CD45+ cells, channels like CD45 used to select this population should also be excluded. For bulk or pooled data, channels that should be excluded because they are used in pre-processing often measure things like housekeeping genes or exogenous control genes.
3) Exclude annotation channels
Annotation channels are channels that contain information on variables that you want to correlate with algorithm results and are often added when looking at bulk data. However as they often aren't continuous variables, they are inappropriate to use in FlowSOM (eg Pre-treatment and post-treatment coded as '1' and '2' in a 'Treatment Status' channel).
4) Consider whether to mix channels with different scales and transformations
Scaling in general should be checked before an analysis to make sure that data are being transformed correctly by the scale settings. Mixing channels that have vastly different amplitudes or ranges after scaling can impact the results of FlowSOM and any downstream algorithms used in the analysis and should be done with care. In general, channels with very high values (such as typical linear channels in cytometry data) or a larger dynamic range will have more influence on your results than channels with lower values. To limit potential problems from this effect and maximize its exploratory utility, FlowSOM centers all channels before it runs by subtracting the mean of each channel from every observation. However, downstream algorithms in your pipeline may not do this, and even with this correction, mixing vastly different scales may prevent FlowSOM from resolving groups of events or observations. Channels with little to no range in signal across the dataset will not influence FlowSOM results and can be removed from the analysis as well, if desired. Learn more about scale settings.
Channels not appearing for selection during FlowSOM setup?
Sometimes there may be channels missing on the FlowSOM setup page but otherwise present in the experiment. This happens when there are panel discrepancies between files in the experiment. The clustering channels available for the FlowSOM analysis must be common to all files in the experiment. If two files have different panels, only those channels common to both will be available for the run. Despite not being able to choose channels unique to certain files, you will still be able to view and analyze these channels within files after the run is complete, however.
The channel differences might just be from differences in the reagent name between files. If this is the case, simply modify the channel information in Cytobank and combine the files into one panel (learn more about panels and channels). Sometimes, however, the files have differences on channel names that can't be edited in Cytobank and the files cannot be combined into a single panel. In this case, consider selective cloning to make an experiment with a subset of the files that have all the same channel information. Another way to approach the problem is to temporarily hide unwanted files by setting them as compensation controls. When files are compensation controls they won't be considered for setting available channels in FlowSOM and thus it will allow channels to appear again on the FlowSOM setup page. If necessary, contact Cytobank Support.
Selecting Compensation for FlowSOM
Compensation should be applied to fluorescent data as it would be for any other analysis of these data by selecting the appropriate compensation from the menu. For files uploaded by DROP or FCS files that have no internal compensation matrix, leave the default option (file-internal compensation) and no compensation will be applied. The compensation setting will be defaulted (and locked) to match the one associated with the chosen population.
How Scale Settings Affect FlowSOM Results
Scaling for FlowSOM is based on the scale settings in your experiment. Scale transformations and scale arguments will affect FlowSOM results, whereas scale min and max will not (i.e. FlowSOM doesn’t care if data are piled on the edge of a 2D plot). In general, if data are scaled appropriately for normal analysis involving manual gating, then the scale settings are fine for FlowSOM analysis. Note that having log based scales may cause FlowSOM failures due to attempted calculation of the logarithm of a negative number or zero. For log-like scaling that handles negative numbers and zero, please use arcsinh scales. Learn more about scale settings.
FlowSOM supports channel normalization (the "Normalize Scales" option). When the Normalize Scales option is turned on, each channel is normalized such that it has a mean value of zero and a standard deviation of 1. This is done by first concatenating all files within the FlowSOM run, then for each event value per channel, subtracting the mean and dividing by the standard deviation of the channel. Normalizing scales can be a useful strategy when channels have different dynamic ranges, as is often the case in fluorescence flow cytometry. However, you should try runs with and without normalization enabled, as sometimes normalization can have a negative impact. You can inspect the impact of normalization rapidly with a FlowSOM-on-viSNE workflow.
Self-Organizing Map (SOM) Creation
As Sofie Van Gassen, the author of the algorithm, states: "A SOM is a specific type of artificial neural network, used for clustering. It consists of a grid of nodes, in which each node represents a point in the multidimensional input space. When clustering, a new point is classified with the node that is its nearest neighbor. The grid is trained in such a way that the nodes closely connected to each other resemble each other more than nodes that are only connected through a long path. As such, the grid contains topological information and a single training point can influence multiple nodes."
Functionally, a SOM is the persistent structure that your populations map to on the minimum spanning tree (MST). If the 'Create a new SOM' option is selected, a new layout will be generated with the FlowSOM run. FlowSOM will generate a new MST structure each time its run with this option selected. If the 'Use an existing SOM from another run' option is selected, the clusters generated from the current FlowSOM run will be mapped to the tree structure of the previous FlowSOM. The previous SOM can be selected from a drop-down menu that contains only the SOMs a user has access to that use the same channels as the analysis currently being set up. The following settings cannot be changed when re-using a SOM (to prevent data integrity issues): Clustering Channels, Number Clusters, Iterations, Normalize Scales. Normalize scales is set to the same configuration that was used in the experiment from which the SOM is being imported. Be careful not to change the scale type or cofactor when re-using SOMs.
Having the ability to make a persistent map can be beneficial to situations like trials or studies that are ongoing where a user wants to see FlowSOM results halfway through, and then use that same SOM for the final analysis. If new populations are present in the subsequent FlowSOM run, new branches will be added to the MST to account for them.
Choosing a Target Number of Clusters
The number of clusters for a FlowSOM run determines how many clusters will be present in the results. The correct number of clusters to select presents a sort of "Goldilocks problem". Setting the target number of clusters lower simplifies the tree but increases the chances of a rare or subtle population being undesirably clustered into an existing cluster with a cell population that is actually different. This effect can be referred to as underclustering. The alternative to this effect is setting the target number of clusters to a higher number. A larger number of clusters will increase the chance of correctly isolating a rare or subtle population, but also increases the amount of noise in the results, since homogeneous abundant populations will be undesirably split across many clusters. Trending toward overclustering is likely desired in most cases because it can be mitigated during analysis by adjusting the meta-clustering value, whereas it can be hard or impossible to detect consequences of underclustering. Thus, a middle road needs to be taken that provides enough clusters for rare or subtle populations while not overly overclustering the dataset. This useful value for target number of clusters will depend on preexisting knowledge of the complexity of the dataset, the number of channels in the data files, the needs of the researcher, and empirical evaluation of results from multiple runs on the same dataset.
Choosing a Target Number of Metaclusters
The number of metaclusters for a FlowSOM run determines how many metaclusters the clusters will be organized into. The correct value for metaclusters will depend on the heterogeneity of the underlying data, the number of clusters present, the needs of the researcher, and an empirical evaluation of the results from multiple runs on the same dataset. Functionally, its beneficial to change this number until the smallest population of interest is identified by a single metacluster in the results. This can be aided by setting the number of traditional clusters higher than anticipated. Using this method, larger populations with some heterogeneity may be split into sub-populations but they can be annotated as such.
Choosing a Metaclustering Method
Hierarchical Consensus clustering works by subsampling the points several times and then makes a hierarchical clustering for each subsampling. Based on how often the same points are clustered together or not, a final clustering is made. By testing the stability of the clustering, this method gets better results than applying the basic hierarchical clustering algorithm.
Hierarchical clustering uses a traditional hierarchical algorithm with no subsampling.
k-means clustering uses a traditional k-means algorithm with no subsampling. Both k-means and normal hierarchical clustering are included as options for reference to other computational methods but often perform less well than the consensus hierarchical clustering method.
Iterations, or rlen in the algorithm's documentation, control how many times the training set is presented to the algorithm for generation of the SOM. The algorithm uses a default iteration number of 10, which is listed as a good number to start with by the authors of the algorithm. If time isn't a factor the number can be increased to potentially improve the quality of the results. However, the authors found that it was generally not advantageous to increase the iteration value as often datasets already contain many cells and thus have an acceptable level of redundancy.
Setting the Seed
The FlowSOM algorithm is stochastic, meaning that it uses random steps to help it do its job. Since we run FlowSOM on a computer, we can “set the seed” of the pseudorandom number generator in the computer to allow us to repeat the same FlowSOM run on the same data with the same settings and get the exact same results. This is most useful for validating reproducibility where required. Additionally, you may be able to test small changes to a FlowSOM run and use a set seed to help keep the clusters in relatively similar locations. If you don’t do anything to set the FlowSOM seed, Cytobank will set one for you; after your FlowSOM is finished, you can see the seed that was selected by viewing the settings that were used for the FlowSOM run. Then, use this same seed for any future run on the same data where you want to reproduce the results.
There are a few important ground rules to keep in mind for setting the seed: 1. You should not use the same seed for every different analysis you run. If you do, this limits the ability of the algorithm to be random, which is part of why the algorithm works well. 2. If you pick your own value for the seed, pick an arbitrary number - something you get from a dollar bill or receipt in your wallet often works well.
Effect of FlowSOM Settings on Algorithm Run Time
FlowSOM run time, although efficient, will be affected primarily by the number of events in the analysis. However, other settings will change the run time as well. Increasing the number of channels used for clustering, the number of clusters, the number of metaclusters, or the number of iterations will result in a longer analysis.
Configuring the PDF output
The PDF output of FlowSOM is very flexible and can be controlled by the following options:
Populations for pie charts - This box controls which events are displayed on the figure that maps the contents of each cluster as a pie chart. If the events in the cluster are not in any of the populations selected, they will appear as an "Unknown" white population. To avoid this, it is possible to select the population on which you are running FlowSOM (e.g. CD45+) and any events not in the specific populations will be labeled as present in the general population.
Channels to plot -This box controls which channels appear in the 'channel_colored' minimum spanning tree (MST) figure that shows the median signal intensity for each cluster node. Generally we recommend including all clustering and readout channels while omitting scatter, Time, and other similar channels, but if a focus on specific markers is desires, it is possible to select those here. Reducing the number of channels to output will also decrease the total output size, making for faster downloads.
Cluster sizing - Clusters will be displayed as Relative by default which results in the clusters that represent many events being larger than the clusters with few events. This effect is visible on the channel_colored_MSTs, the population_pie_charts, and the star_plots. To change these clusters to all be the same size, select the Fixed option, and to feature copies of the plots sized both ways, select Relative and Fixed. You can configure the "Max cluster size" for the Relative option, which specifies the size (in pixels) of the largest cluster. Likewise, you can configure the cluster size for all clusters under the Fixed option. If you are experiencing overlapping clusters, try reducing the cluster size.
Output File Type - This option controls whether the figures generated will be in a PDF or PNG format. PDF files generated by FlowSOM will be significantly smaller than PNG files; however, PNG files are typically better for creating figures.
Toggle metacluster background in plots - The metaclusters generated by FlowSOM will appear as translucent, colored rings around each cluster. Each cluster with the same colored ring is part of the same metacluster. These rings will appear by default on only the population_pie_chart figure, however, they can also be included on the channel_colored_MSTs and the legend plots.