viSNE is an algorithm that provides a two dimensional categorized map representation of complex multi-parameter data. There are a small number of configuration steps that must be taken before running a viSNE analysis. This article provides an overview of these configuration steps and concepts to consider during setup. Click the links below to jump to any section in this article.
- Selecting Populations and Samples
- Event Count Sampling
- Selecting Channels
- Scale Settings and viSNE
- Number of Iterations, Perplexity, and Theta
- Effect of viSNE Settings on Algorithm Run Time
The first step in running a viSNE analysis is choosing one or more populations from which event data will be sourced, and which samples (i.e., files) the data will be sourced from per population. The populations available in this list are taken from manual gating done in the experiment on Cytobank, including any tailored gates. If populations are missing from the list or event counts seem incorrect, make sure the most recent version of changes in the gating interface has been applied to the experiment.
When a population is selected, a list of samples will appear. Choose the samples that should be included in the viSNE analysis. Note that samples designated as experimental controls will not appear for selection.
(animation showing the selection of populations and samples)
When population(s) and sample(s) are selected, each permutation will display how many events are available to sample. Most viSNE runs will choose only a single population such as live singlets, or CD45+ cells. From this selection, viSNE will create a map that includes all the subsequent child cell types within the data and a visualization of the heterogeneity therein. Thus, population(s) and sample(s) should be chosen based on the scope of questions being asked. A dump gate might be used to exclude lineages that are not of interest to the researcher. Multiple populations can be chosen if certain lineages want to be excluded while keeping multiple others. The actual number of events that get sampled from each population is displayed next to the total available events and will be dependent on the next configuration for the analysis: event sampling.
(example of multi-population selection for viSNE with each population/file combination showing the number of available events. Most viSNE runs will select a single population only)
Current event count caps are 2M events on Enterprise Cytobank and 1.3M events on Premium Cytobank. Learn more about different Cytobank offerings.
Besides the total number of events to sample, there are two additional settings currently supported for event sampling.
1) Proportional sampling - Each population/file combination present will be sampled according to its relative abundance as a function of total available events. The percentages will match between the available events and the sampled events.
(proportional sampling example - 22,000 events have been selected for viSNE analysis)
2) Equal sampling - Each population/file combination present will be sampled equally. The number of events sampled will be equal to the user-entered sampling count divided by the number of population/file entries.
(equal sampling example - 22,000 events have been selected for viSNE analysis)
How many events to sample is partly guided by how many files are present in the analysis and partly guided by the analysis needs and time constraints of the researcher. When more files are present to be analyzed it generally forces the overall event count higher in order to pull a reasonable representative number of events from each file. In situations where a reasonable number of events can be pulled from files, the researcher may lower the overall number of events to accomplish the analysis faster for a quick result. The overall event count may conversely be raised to probe a dataset more deeply for rare cell subsets or to more completely categorize the dataset with the algorithm. viSNE runs with more events take more time and may necessitate an increase in the number of iterations used in the analysis (see discussion on iterations below).
In general, equal sampling is beneficial for downstream visual analysis because it makes differences in the structure and relative abundances of populations between viSNE maps more easily comparable. Differences in starting event count between files confounds simple comparisons between their resulting viSNE maps. Equal sampling should very likely be used in cases that all input files have reasonably similar abundances of events, a single population is being chosen, and the overall events being sampled results in per-file sampling below the available events in each file.
The need for using proportional sampling may arise in certain circumstances which are generally the opposite of those outlined directly above. If a minority of files in the analysis have unfortunately small numbers of events, proportional sampling can allow the maximum number of events to be taken from each file. In viSNE runs that sample from multiple manually gated populations of differing abundance, proportional sampling can accomplish the similar task of retaining proper representation of events from each population and not limiting the major population based on the minor population, which would happen with equal sampling.
The channels selected for viSNE are what the algorithm uses to create a categorized map of the data. The following major points should be considered when selecting channels for a viSNE analysis:
1) Select the channels that are needed to separate cellular subsets of interest
In normal attempts at phenotyping, would you be able to separate out all T cell populations if you only had access to data for the CD3 channel? Of course not, and no algorithm will be able to either. The set of channels provided to the algorithm are what it will use to categorize the data into subsets. Think about what markers you would use to define populations if you were gating the data and not using an algorithm. This is an excellent starting point.
The markers generally selected for an analysis are CD markers and other non-CD phenotyping markers. Intracellular signaling markers can be used as well, but an understanding of the nuances of including them in the analysis should be understood.
In general, think of analysis as happening in two parts:
- Find populations
- Analyze these populations
The channels selected for the algorithm will serve the first part, finding populations. This also can be considered as a replacement for manual gating. The analysis of the populations that are found is a different step that leverages the results of the algorithm.
2) Deselect channels that are known to not be useful
An algorithm should be run on a base population of gated events. Channels that have already been used to form the base population that the algorithm is run on are likely not going to be useful. For example, if the algorithm is being run on a subset of CD45+ events, CD45 won't necessarily make a very useful channel to include. All the events are already positive for the marker and nominal differences in CD45 levels in the CD45+ subset aren't of interest to use for discovering heterogeneity in the data. Other examples of channels to exclude from the analysis on this principle are scatter, DNA content, viability, and beads. The time channel should never be used without a sophisticated understanding of why it should be used. Channels with little to no range in signal across the dataset will not influence categorization and can be removed from the analysis as well, if desired.
3) Don't mix linear channels and scaled channels
Examples of linear (unscaled) channels can be forward and side scatter in fluorescence cytometry and cell_length or event_length in mass cytometry (CyTOF). Other channels can be linear as well. Scaling in general should be checked before an analysis to avoid mixing of scale types and to make sure that data are being transformed correctly by the scale settings. The reason scale types shouldn't be mixed is because it can throw off results. The reason it throws off results is because the amplitude of linearly scaled values are generally very high, whereas the amplitude of scaled values becomes low as a result of the scaling. Learn more about scale settings.
Sometimes there may be channels missing on the viSNE setup page but otherwise present in the experiment. This happens when there are panel discrepancies between files in the experiment. The channels available for the viSNE analysis must be common to all files in the experiment. If two files have different panels, only those channels common to both will be available for the run. Despite not being able to choose channels unique to certain files, you will still be able to view and analyze these channels within files after the run is complete, however.
The channel differences might just be from differences in the reagent name between files. If this is the case, simply modify the channel information in Cytobank and combine the files into one panel (learn more about panels and channels). Sometimes, however, the files have differences on channel names that can't be edited in Cytobank and the files cannot be combined into a single panel. In this case, consider selective cloning to make an experiment with a subset of the files that have all the same channel information. Another way to approach the problem is to temporarily hide unwanted files by setting them as compensation controls. When files are compensation controls they won't be considered for setting available channels in viSNE and thus it will allow channels to appear again on the viSNE setup page. If necessary, contact Cytobank Support.
For scale settings and viSNE, the general rule is that "viSNE sees what you see" based on the scale settings in your experiment. The exception to this rule, however, is that viSNE doesn't pay attention to scale min or max. So if data are piled on the edge of the plot, viSNE doesn't see them this way. It sees the data in the natural continuum uninterrupted by scale min and max. The one scale setting affecting viSNE is the scale argument of arcsinh scales. In general, if data are scaled appropriately for normal analysis involving manual gating, then the scale settings are fine for viSNE analysis.
Note that having log based scales may cause viSNE failures due to attempted calculation of the logarithm of a negative number or zero, which are commonly present in data. For log-like scaling that handles negative numbers and zero, please use arcsinh scales. Learn more about scale settings.
The ability to set advanced settings for viSNE, including number of iterations, perplexity, and theta, is not available for user accounts with trial subscriptions except in special cases on Enterprise Cytobanks.
The Barnes-Hut implementation of t-SNE, which is what powers viSNE in Cytobank, has a number of settings that can be changed to affect the results of a viSNE analysis. In most cases for unfamiliar users we advise not changing these settings until a practical need emerges for which to change them. What practical considerations underlay the decision to adjust these settings? See below for discussion in context of each setting.
viSNE proceeds by a step-wise optimization of the placement of events in a two dimensional space in order to best reflect the similarity of events in the high dimensional space of the dataset. Each step of this process is an iteration of the algorithm. Each iteration progressively minimizes the difference between the high-dimensional similarity between the cellular events and the low-dimensional similarity between events displayed on the viSNE map. The effect is that events that are similar to each other according to their high dimensional attributes are grouped in close proximity on the viSNE map. The default number of iterations in Cytobank is 1000.
Since a finite number of iterations are used in this optimization process, it's possible that not enough iterations will happen for the algorithm to converge on a useful result. This situation becomes more likely as the number of events and channels present in the viSNE analysis are raised and will generally manifest as a lack of resolution among cellular populations on the viSNE map:
(example of viSNE maps that did not have enough iterations. Each viSNE map represents a separate dataset and viSNE run from either mass or flow cytometry. Each dataset is colored by a major phenotyping marker such as CD3, CD4, or CD8. Each map shows a relative lack of resolution and spatial localization of events and would be difficult to use for gating or clustering to identify cellular populations. In a viSNE map that has had enough iterations, events with a similar phenotype should be nicely grouped instead of stretched through different parts of the map)
The hallmarks of lack of convergence of a viSNE map, such as those seen above, are sometimes colloquially termed "balling", "filamenting", or "stretching" of events throughout the map that are expected to be grouped together spatially. Other visual cues indicating lack of convergence include viSNE islands that are more pointed instead of having smoother rounded edges. An additional trait that is correlated with viSNE maps that have not had enough iterations is a lower value range of the tSNE channels themselves. This can be seen above with tSNE channel ranges around 20-30 units. In the case of the map with the greatest tSNE range (bottom right), the results are also the most acceptable of the four maps displayed, even though they are still not ideal. Note: the relevance of range and the units correlated with a good viSNE map may change depending on other settings.
On the other hand of not having enough iterations to get an effective viSNE map is the case that more iterations are being used than are necessary. There is no negative consequence to this situation except for the potential to needlessly increase the run time of the algorithm. Lowering the number of iterations can be considered in order to get results more quickly in situations where the algorithm is able to achieve adequate resolution of the cellular populations with fewer iterations. It should be noted, however, that this strategy of minimizing number of iterations may not result as expected because the viSNE algorithm changes its behavior depending on the overall set number of iterations. For example, consider a 1000 iteration viSNE run that has converged adequately at iteration 700 as evidenced by KL divergence values. Running this viSNE again with 700 iterations instead of 1000 might not achieve the expected result.
Perplexity can be thought of as a rough guess for the number of close neighbors any given cellular event will have. The algorithm uses it as part of calculating the high-dimensional similarity of two points before they are projected into low-dimensional space. The default perplexity setting in Cytobank is 30.
Increasing the value of perplexity generally acts to accentuate spatial separation of events on the viSNE map. Decreasing this value has the opposite effect. The degree to which these differences practically matter in the resulting viSNE map will be up to the researcher. It is likely that staying on or above the default value will result in viSNE maps that are most amenable to downstream categorization by manual gating or clustering of the viSNE map, since accentuated differences can help resolve cellular populations in this context.
(the effect of changing perplexity on different viSNE runs for the same dataset with all other settings being equal. Perplexity of 30 is usually a good default and identifies population centers with superior resolution than a perplexity of 5. Increasing the perplexity separates these populations on the map more dramatically, but with diminishing practical value. KLD is KL divergence. The amount of time the run took to complete is also noted.)
The theta parameter can be increased or decreased in order to tune the balance of speed and accuracy in the viSNE run as compared to the original tSNE algorithm. A higher theta results in a faster run with coarser approximations of the cellular populations on the viSNE map (as compared to the map that would result from the original tSNE). In contrast, a lower theta results in a slower run with more accurate approximations of the cellular populations on the viSNE map. Theta = 0 corresponds to the exact version of the original tSNE algorithm, which can be impossible to run on datasets beyond the low thousands of events for performance reasons. In the figure below, the populations shown on the map at theta = 0.01 can be considered the ‘gold standard’ of what the original tSNE algorithm would produce. It can be seen that as theta increases from 0 to 1, the population centers remain fairly consistent, but it becomes more clear that they are approximations by theta = 1. At theta = 1.5 (above the Cytobank limits for theta), the populations on the viSNE map no longer look like the ‘gold standard’ theta = 0.01 populations. The default value for theta in Cytobank is 0.5.
(the influence of changing theta on viSNE results - all other settings default)
The amount of time that viSNE takes to execute can vary substantially due to a number of factors. Read our analysis on how viSNE settings affect algorithm run time.