viSNE is an algorithm that provides a two dimensional categorized map representation of complex multi-parameter data. There are a small number of configuration steps that must be taken before running a viSNE analysis. This article provides an overview of these configuration steps and concepts to consider during setup. Click the links below to jump to any section in this article.
- Selecting Populations and Samples
- Event Count Sampling
- Selecting Channels
- Selecting Compensation
- Scale Settings and viSNE
- Advanced Settings
- Effect of viSNE Settings on Algorithm Run Time
The first step in running a viSNE analysis is choosing one or more populations from which the events will be sourced, and which samples (i.e. files) will be used for the analysis. The populations available in this list are taken from manual gating done in the experiment on Cytobank, including any tailored gates. If populations are missing from the list or event counts seem incorrect, make sure the most recent version of changes in the gating interface has been applied to the experiment.
When a population is selected, a list of samples will appear. Choose the samples that should be included in the viSNE analysis. Note that samples designated as experimental controls will not appear for selection.
(animation showing the selection of populations and samples)
To run viSNE on single cell data, you should typically select only a single population and then select all files that you want to compare. At a minimum, we recommend pre-gating to live singlets. However, you may want to pre-gate further (e.g. to remove a dump channel or to get to CD4+ T cells) in order to focus on the cells your panel is designed to interrogate or to be able to combine files from more samples in your viSNE map (while still falling within the event limit). From the selected pre-gated population, viSNE will create a visualization of all of the child cell types within the data and the heterogeneity in them. You may also use multiple pre-gated populations if you want to include multiple lineages in your viSNE map while excluding others.
To run viSNE on bulk or pooled data, you will typically select the ‘Ungated’ population to include all of the samples in your file in the viSNE. Alternatively, if you have used the gating tools to stratify your data ahead of time, you might select a population that corresponds to a stratified group.
Once you have selected populations and files to include in your viSNE run, the number of events that will be included (sampled) from each population/file is displayed next to the total available events and will be dependent on the next configuration step: event sampling.
(example of multi-population selection for viSNE with each population/file combination showing the number of available events. Most viSNE runs will select a single population only)
With viSNE in Cytobank, you can include up to 2M events on Enterprise Cytobank and 1.3M events on Premium Cytobank. Learn more about different Cytobank offerings.
You can adjust the total number of events that will be included in your viSNE run in the box next to ‘Desired Total Events’ on the setup page. If your selected populations and files include more events than the ‘Desired Total Events’, a subset of events in the populations and files will be selected randomly for inclusion in the map. In this case, there are two options for controlling the way that the events will be chosen:
1) Proportional sampling - Each population/file combination present will be sampled according to its relative abundance as a function of total available events. The percentages will match between the available events and the sampled events.
(proportional sampling example - 22,000 events have been selected for viSNE analysis)
2) Equal sampling - Each population/file will be sampled equally. The number of events sampled will be equal to the user-entered ‘Desired Total Events’ divided by the number of populations/files.
(equal sampling example - 22,000 events have been selected for viSNE analysis)
How many events to sample is guided by how many files are present in the analysis, the analysis goals, and the time constraints of the researcher. If you have more samples/files to compare and/or are studying rare populations, you will need a higher overall event count to be able to include enough events from each file to see the populations you are interested in on the viSNE map, but this will increase the time it takes to run your analysis and may necessitate an increase in the number of iterations used in the analysis (see discussion on iterations below). For more information, see our blog post that summarized how many events you need from each sample to detect rare populations.
Equal sampling is a good choice for most applications or if you aren’t sure. Using equal sampling makes it visualization of differences in the structure and relative abundance of populations across files easy. It’s also required by CITRUS when comparing population abundances, and so if this viSNE will be used to visualize those results, it makes sense to use equal sampling here. It’s also perfectly acceptable to use equal sampling if your intention is to compare functional marker expression in a given population(s) across samples. The one case in which equal sampling may not be ideal is when you have one or more files that have very low event counts compared to most of the other files, because these files would then limit the number of events taken from all the files. In this case, you can either exclude these low event count files or use proportional sampling.
Proportional sampling should be used in certain special circumstances. For example, use proportional sampling when the individual files vary widely in the number of available events and you don’t want to exclude ones with low event counts, or when you are selecting multiple populations for inclusion in one viSNE map. It is acceptable to use proportional sampling and then compare population-specific functional marker expression analysis algorithms like CITRUS, and doing so can help include samples in these analyses even when the samples have a wide range of event counts.
The channels (or markers) selected for viSNE are what the algorithm uses to identify events or observations that are similar and put them close together in the resulting map of the data. The following major points should be considered when selecting channels or markers for a viSNE analysis:
1) Select the channels that may contribute to separating cell subsets or groups of samples
For single cell data, at a minimum this should include any phenotyping channels you would typically require to gate cell subsets of interest. This will allow you to use viSNE in a pipeline that replaces manual gating and to compare your manually gated populations easily. Advanced workflows may also include functional marker channels, but this may make interpretation of results more difficult.
For bulk or pooled data, selected channels should include any markers that you think may contribute to separating groups of samples in your data. Typically, this will include all markers in your file that were not used for data pre-processing (e.g. normalization), annotation, or stratification.
2) Exclude channels that are used for data pre-processing
Data pre-processing includes pre-gating steps that are done in Cytobank and steps such as data normalization or QC that are done upstream. Channels like this that have already been used in the analysis pipeline should generally be excluded in viSNE. For single cell data, examples of these channels include scatter, DNA content, viability, bead, and time channels. If your pre-gated population is something like CD45+ cells, channels like CD45 used to select this population should also be excluded. For bulk or pooled data, channels that should be excluded because they are used in pre-processing often measure things like housekeeping genes or exogenous control genes.
3) Exclude annotation channels
Annotation channels are channels that contain information on variables that you want to correlate with algorithm results or use for data stratification or visualization. In single cell data, these might include cluster channels resulting from a clustering algorithm like SPADE or CITRUS, or sample-level variables like clinical outcome, treatment, or demographic variables. In bulk or pooled data, these will code for variables like clinical outcome, treatment, or demographic variables.
4) Consider whether to mix channels with different scales and transformations
Scaling in general should be checked before an analysis to make sure that data are being transformed correctly by the scale settings. Mixing channels that have vastly different amplitudes or ranges after scaling can impact the results of viSNE and any downstream algorithms used in the analysis and should be done with care. In general, channels with very high values (such as typical linear channels in cytometry data) or a larger dynamic range will have more influence on your results than channels with lower values. To limit potential problems from this effect and maximize its exploratory utility, viSNE centers all channels before it runs by subtracting the mean of each channel from every observation. However, downstream algorithms in your pipeline may not do this, and even with this correction, mixing vastly different scales may prevent viSNE from resolving groups of events or observations. Channels with little to no range in signal across the dataset will not influence viSNE results and can be removed from the analysis as well, if desired. Learn more about scale settings.
Sometimes there may be channels missing on the viSNE setup page but otherwise present in the experiment. This happens when there are panel discrepancies between files in the experiment. The channels available for the viSNE analysis must be common to all files in the experiment. If two files have different panels, only those channels common to both will be available for the run. Despite not being able to choose channels unique to certain files, you will still be able to view and analyze these channels within files after the run is complete, however.
The channel differences might just be from differences in the reagent name between files. If this is the case, simply modify the channel information in Cytobank and combine the files into one panel (learn more about panels and channels). Sometimes, however, the files have differences on channel names that can't be edited in Cytobank and the files cannot be combined into a single panel. In this case, consider selective cloning to make an experiment with a subset of the files that have all the same channel information. Another way to approach the problem is to temporarily hide unwanted files by setting them as compensation controls. When files are compensation controls they won't be considered for setting available channels in viSNE and thus it will allow channels to appear again on the viSNE setup page. If necessary, contact Cytobank Support.
Compensation should be applied to fluorescent data as it would be for any other analysis of these data by selecting the appropriate compensation from the menu. For files uploaded by DROP or FCS files that have no internal compensation matrix, leave the default option (file-internal compensation) and no compensation will be applied.
Scaling for viSNE is based on the scale settings in your experiment. Scale transformations and scale arguments will affect viSNE results, whereas scale min and max will not (i.e. viSNE doesn’t care if data are piled on the edge of a 2D plot). In general, if data are scaled appropriately for normal analysis involving manual gating, then the scale settings are fine for viSNE analysis. Note that having log based scales may cause viSNE failures due to attempted calculation of the logarithm of a negative number or zero. For log-like scaling that handles negative numbers and zero, please use arcsinh scales. Learn more about scale settings.
The ability to set advanced settings for viSNE, including number of iterations, perplexity, and theta, is not available for user accounts with trial subscriptions except in special cases on Enterprise Cytobanks.
The Barnes-Hut implementation of t-SNE, which underlies viSNE in Cytobank, has a number of settings that can be changed to affect the results of a viSNE analysis. For new users, we advise using Cytobank’s default settings until a practical need emerges for which to change them. See below for discussion of factors that may require changing each setting.
The viSNE algorithm is stochastic, meaning that it uses random steps to help it do its job. Since we run viSNE on a computer, we can “set the seed” of the pseudorandom number generator in the computer to allow us to repeat the same viSNE run on the same data and get the exact same results. This may be useful for validating workflows that include viSNE or repeating the same viSNE in a different Cytobank experiment. Additionally, you may be able to test small changes to a viSNE run and use a set seed to help keep the islands in relatively similar locations. If you don’t do anything to set the viSNE seed, Cytobank will set one for you; after your viSNE is finished, you can see the seed that was selected by viewing the settings that were used for the viSNE run. Then, use this same seed for any future run on the same data where you want to reproduce the results. There are a few important ground rules to keep in mind for setting the seed: 1. You should not use the same seed for every different analysis you run. If you do, this limits the ability of the algorithm to be random, which is part of why the algorithm works well. 2. If you pick your own value for the seed, pick an arbitrary number - something you get from a dollar bill or receipt in your wallet often works well.
viSNE works by repeatedly adjusting the placement of events or observations in a two dimensional space in order to best reflect the similarity of these events or observations in the high dimensional space of the dataset. Each step of this process is an iteration of the algorithm. Unlike some other machine learning algorithms that stop iterating when convergence is reached, viSNE’s underlying math doesn’t allow this, and the number of iterations must be set by the user. The default number of iterations in Cytobank is 1000.
For certain situations, this may not be enough iterations for the algorithm to converge on a useful result. The number of iterations needed depends on the data type, sample type, panel, and number of events included in the viSNE run. For fluorescent cytometry data, a good starting point is roughly 1000 iterations for every 100k events. Mass cytometry data typically requires fewer iterations than fluorescent data.
When the number of iterations used is too low, the groups of cells or populations on the viSNE map will not be resolved:
(example of viSNE maps that did not have enough iterations. Each viSNE map represents a separate dataset and viSNE run from either mass or flow cytometry. Each dataset is colored by a major phenotyping marker such as CD3, CD4, or CD8. Each map shows a relative lack of resolution and spatial localization of events and would be difficult to use for gating or clustering to identify cellular populations. In a viSNE map that has had enough iterations, events with a similar phenotype should be nicely grouped instead of stretched through different parts of the map)
The hallmarks of lack of convergence of a viSNE map, such as those seen above, are sometimes colloquially termed "balling", "filamenting", or "stretching" of events throughout the map that are expected to be grouped together spatially. Other visual cues indicating lack of convergence include viSNE islands that are more pointed instead of having smoother rounded edges. An additional trait that is correlated with viSNE maps that have not had enough iterations is a lower value range of the tSNE channels themselves. This can be seen above with tSNE channel ranges around 20-30 units. In the case of the map with the greatest tSNE range (bottom right), the results are also the most acceptable of the four maps displayed, even though they are still not ideal. Note: the relevance of range and the units correlated with a good viSNE map may change depending on other settings.
In contrast to having too few iterations, using more iterations than required to separate groups of similar events or observations on a viSNE map needlessly increases the run time of the algorithm.
Lowering the number of iterations can be considered in order to get results more quickly in situations where the algorithm is able to achieve adequate resolution with fewer iterations. Note that the relationship between KL divergence in a higher iteration run and successful resolution of a viSNE map is not linear. For example, just because KL divergence converges at iteration 700 in a successful 1000 iteration viSNE run does not mean that running this viSNE again with 700 iterations will result in successful resolution of the map.
Perplexity can be thought of as a rough guess for the number of close neighbors (or similar points) any given event or observation will have. The algorithm uses it as part of calculating the high-dimensional similarity of two points before they are projected into low-dimensional space. The default perplexity setting in Cytobank is 30 and works well for most datasets with 100 or more observations or events.
The maximum perplexity allowed for any viSNE run is based on the number of events included in the run; Cytobank will warn you if your perplexity is set too high for viSNE to run. The perplexity cannot be greater than the number of events minus 1 divided by 3. For example, with 30 observations in a viSNE run, the perplexity cannot be greater than (30 - 1) / 3, so it should be set to 9.
Increasing the value of perplexity generally acts to accentuate spatial separation of events on the viSNE map. Decreasing this value has the opposite effect. The degree to which these differences practically matter in the resulting viSNE map depends on the analysis requirements. If there is not enough structure to capture distinct groups in the dataset (e.g., when the pre-gated population is CD4+ T cells), islands won’t form on the viSNE map even when perplexity is increased to the maximum.
(the effect of changing perplexity on different viSNE runs for the same dataset with all other settings being equal. Perplexity of 30 is usually a good default and identifies population centers with superior resolution than a perplexity of 5. Increasing the perplexity separates these populations on the map more dramatically, but with diminishing practical value. KLD is KL divergence. The amount of time the run took to complete is also noted.)
Changing the value of theta is only recommended in the very rare case where your viSNE runs are failing or canceling due to memory limitations with large numbers of events, channels, iterations, or perplexity. In this case, we recommend increasing theta to 0.8 - 1.
Theta controls how similar the Barnes-Hut implementation of tSNE is to the original tSNE algorithm (a lower value means it is more similar). The Barnes-Hut implementation of tSNE was created to allow the algorithm to be used on larger datasets (with more than a few thousand events total) with faster run times, so decreasing it is generally never recommended. In certain cases, it may be useful to increase theta to decrease algorithm run time, but this may result in groups of events or observations being separated on the map that don’t have meaningful differences in marker expression.
(the influence of changing theta on viSNE results - all other settings default)
The amount of time that viSNE takes to execute can vary substantially due to a number of factors. Read our analysis on how viSNE settings affect algorithm run time.