/>

How to configure and run a dimensionality reduction analysis

Background

Dimensionality reduction (DR) algorithms provide a two-dimensional map representation of complex multi-parameter data facilitating a more comprehensive visualization of the data. There are a small number of configuration steps that must be taken before running any DR analysis. In this article, you will find an overview of those configurations and concepts to consider during setup. They are divided into basic settings that must be configured prior to any analysis, advanced settings that might be changed to fine-tune results, and additional data transformations. Click the links below to jump to any section in this article. 

Basic Settings 

Advanced Settings 

Transformations 

Effect of Advanced Settings on Algorithm Run Time 

Basic Settings

You must set up the essential settings prior to running a DR algorithm. 

Selecting Populations and Samples

The first step in running a DR analysis is choosing one population (or more if you are running a viSNE analysis) from which the events will be sourced and which samples (i.e., files) will be used for the analysis. The populations available in this list are taken from the ones included in the Cytobank experiment, as well as any tailored gates. If populations are missing from the list or event counts seem incorrect, make sure the most recent version of changes in the gating editor has been applied to the experiment. 

On a viSNE analysis, a list of samples will appear when a population is selected. On a tSNE-CUDA, UMAP or opt-SNE analysis the list of samples will appear under the FCS file list. Choose the samples that should be included in the dimensionality reduction analysis. Note that samples designated as experimental controls will not appear for selection. 

 

(Animation showing the selection of populations and samples for viSNE)

 

4.1b_DR_files.gif

(Animation showing the selection of populations and samples for  tSNE-CUDA, UMAP or opt-SNE )

To run a DR analysis you should select a population and then select all files that you want to compare. At a minimum, we recommend pre-gating to live singlets. However, you may want to pre-gate further (e.g., to remove a dump channel or to get to CD4+ T cells) in order to focus on the cells your panel is designed to interrogate or to be able to combine files from more samples in your DR resulting map (while still falling within the event limit). From the selected pre-gated population, the DR algorithms will create a visualization of all the child cell types within the data and the heterogeneity in them. viSNE allows you to select multiple pre-gated populations. This may be useful if you want to exclude some specific lineages, nevertheless, each starting population will comprise only a portion of the viSNE map. You may want to generate an overlay dot plot to visualize your results. 

Once you have selected populations and files to include in your analysis, the number of events that will be included (sampled) from each population/file is displayed next to the number of available events on the population and the total number of events per sample (when running tSNE-CUDA, UMAP or opt-SNE) and will be dependent on the next configuration step: event sampling. 

 blobid0.png   blobid1.png

(on the left: example of multi-population selection for viSNE with each population/file combination showing the number of available events. On the right: example of population and file selection for tSNE-CUDA, UMAP or opt-SNE showing the number of available events) 

Event Sampling

With tSNE-CUDA or UMAP in the Cytobank platform, you can include up to 10M events, up to 3M events when running an opt-SNE and with viSNE up to 2M events on Enterprise Cytobank and 1.3M events on Premium Cytobank. Learn more about different Cytobank offerings. The total maximum number of events will be directly dependent on the number of channels selected. In the table below you can confirm the maximum number of channels for a given event range. 

 Art__4__How_to_Configure_and_Run_a_Dimension_Reduction_Analysis_docx.png

(Table showing the channel limits for a given event range on the four available DR algorithms; * The maximum number of channels for viSNE has not been defined. Check this analysis on how dimensionality reduction settings affect algorithm run time)

You can adjust the total number of events that will be included in your DR algorithm run in the box next to Desired Total Events on the Event Sampling setting. If your selected populations and files include more events than the Desired Total Events, a subset of them will be randomly selected per file and population to be included in the DR resulting map. In this case, there are two options for controlling the way that the events will be chosen: 

1) Proportional sampling - Each population/file combination present will be sampled according to its relative abundance as a function of total available events. The percentages will match between the available events and the sampled events. 

 

 blobid2.png

(Proportional sampling example - 22,000 events have been selected for viSNE analysis on the top and for tSNE-CUDA, UMAP or opt-SNE on the bottom)

2) Equal sampling -  Each population/file will be sampled equally. The number of events sampled will be up to the user-entered Desired Events per file/Events per Population multiplied by the number of files. If any selected population/file has less than the specified number of events, all populations/files will be equally sampled down to the smallest population/file.

 blobid3.png

(equal sampling example - 22,000 events have been selected for viSNE analysis on the top and for tSNE-CUDA, UMAP or opt-SNE on the bottom) 

Discussion on Choosing Proportional or Equal Sampling

How many events to sample is guided by how many files are present in the analysis, the analysis goals, and the time constraints of the researcher. If you have more samples/files to compare and/or are studying rare populations, you will need a higher overall event count to be able to include enough events from each file to see the populations you are interested in on the DR algorithm map 

Whether to use proportional or equal sampling will depend on the availability of events within the dataset, the relative abundance of each file, and the goals of the analysis. 

Equal sampling is a good choice for most applications or if you aren’t sure. Using equal sampling makes the visualization of differences in the structure and relative abundance of populations across files easy. It’s also required by CITRUS when comparing population abundances, so if this DR map will be used to visualize those results, it makes sense to use equal sampling here. It’s also perfectly acceptable to use equal sampling if your intention is to compare functional marker expression in one or more given populations across samples. The one case in which equal sampling may not be ideal is when you have one or more files that have very low event counts compared to most of the other files, because these files would then limit the number of events taken from all the files. In this case, you can either exclude these low event count files or use proportional sampling. 

Proportional sampling should be used in certain special circumstances. For example, use proportional sampling when the individual files vary widely in the number of available events and you don’t want to exclude the ones with low event counts, or when you are selecting multiple populations for inclusion in one viSNE map. It is acceptable to use proportional sampling and then compare population-specific functional marker expression using algorithms like CITRUS, and doing so can help include samples in these analyses even when the samples have a wide range of event counts. 

Selecting Channels

The channels (or markers) selected for a DR analysis are what the algorithms use to identify events or observations that are similar and put them close together in the resulting map of the data. The following major points should be considered when selecting channels or markers for a DR analysis: 

1) Select the channels that may contribute to separating cell subsets or groups of samples 

At a minimum this should include any phenotyping channels you would typically require for gating the cell subsets of interest or channels that may contribute to separate groups of samples in your bulk data. This will allow you to use the DR step in a pipeline that replaces manual gating and to compare your manually gated populations easily. Advanced workflows may also include functional marker channels, but this may make interpretation of results more challenging. 

2) Exclude channels that are used for data pre-processing 

Data pre-processing includes pre-gating steps that are done in the Cytobank platform and steps such as data normalization or QC that are done upstream. Channels that have already been used in the analysis pipeline should generally be excluded in the DR algorithm. Examples of these channels include: 

  • Scatter 
  • DNA content 
  • Viability 
  • Bead 
  • Time 
  • Housekeeping genes 
  • Exogenous control genes 

If your pre-gated population is something like CD45+ cells, channels like CD45 used to select this population should also be excluded. 

3) Exclude annotation channels 

Annotation channels contain information on variables that you want to correlate with algorithm results or use for data stratification or visualization. This might include cluster channels resulting from a clustering algorithm like SPADE or FlowSOM, or sample-level variables like clinical outcome, treatment, or demographic variables. 

4) Consider whether to mix channels with different scales and transformations 

Scaling in general should be checked before an analysis to make sure that data are being transformed correctly by the scale settings. Mixing channels that have vastly different  ranges after scaling can impact the results of a DR algorithm and any downstream algorithms used in the analysis and should be done with care. In general, channels with very high values (such as typical linear channels in cytometry data) or a larger dynamic range will have more influence on your results than channels with lower values. Channels with little to no range in signal across the dataset may not influence viSNE results and can be removed from the analysis as well, if desired. If you are running tSNE-CUDA, UMAP or opt-SNE consider normalizing the scales (see discussion below) when including those channels. Learn more about scale settings. 

Channels not appearing for selection during DR setup? 

Sometimes there may be channels missing on the DR setup page but otherwise present in the files. This happens when there are panel discrepancies between files selected for the analysis. The channels available for the DR analysis must be common to all files selected for this analysis. If two files have different panels, only those channels common to both will be available for the run. If you have more than one panel in your experiment, selecting only files that share the same panel will allow you to select all channels from that panel. If the panel discrepancies are due to the differences in the reagent name between files that otherwise share the same panel, simply modify the long channel (i.e., reagent name) in the Cytobank platform and combine the files into one panel (learn more about panels and channels). 

Sometimes, however, the files have differences in the short channel names that can't be edited in Cytobank software and the files cannot be combined into a single panel. If you have flow cytometry data and have access to Kaluza Analysis software, it might be easier to edit channels in Kaluza and export new FCS files using the Kaluza Cytobank Plugin. For more information visit www.kaluzasoftware.com. Or in the case of Mass Cytometry data with more than 64 channels, you can consider using FCS file editing R tools like Premessa to edit the names. Please contact Cytobank Support if you need further assistance. 

 

Advanced Settings 

For new users, we advise using Cytobank’s default settings until a practical need emerges for which to change them. See below for discussion of factors that may require changing each setting on each of the DR algorithms. Since some of the advanced settings are present in more than one DR algorithm, we suggest you jump to the relevant section for your analysis. 

tSNE-CUDA 

tSNE-CUDA is a GPU-accelerated implementation of t-SNE that allows for a fast DR. On Cytobank platform implementation the advanced settings can automatically be adjusted depending on the basic settings that you selected. Taking these two features together, speed and no need to adjust advanced settings, tSNE-CUDA will quickly generate nicely resolved tSNE maps. If you want to configure the advanced settings, tSNE-CUDA offers the option to adjust iterations, perplexity, learning rate and early exaggeration. 

 blobid4.png

(tSNE-CUDA advanced settings) 

Iterations 

tSNE-CUDA (as the rest of the tSNE analysis) works by repeatedly adjusting the placement of events or observations in a two-dimensional space in order to best reflect the similarity of these events or observations in the high-dimensional space of the dataset. Each step of this process is an iteration of the algorithm. Unlike some other machine learning algorithms that stop iterating when convergence is reached, tSNE-CUDA will perform the number of iterations set by the user. The number of iterations needed depends on the data type, sample type, panel, and number of events included in the tSNE-CUDA run. If the Automatic box is checked, the number of iterations will be adjusted according to the number of events by dividing this number by 1500 (with a minimum of 750). When the number of iterations used is too low, the groups of cells or populations on the tSNE-CUDA map will not be resolved: 

 blobid5.png

(Example of tSNE-CUDA maps with different number of iterations. Each map represents a separate run colored by CD4 and CD8The two first maps show a relative lack of resolution and spatial localization of events and would be difficult to use for gating or clustering to identify cellular populations. In a tSNE-CUDA map that has had enough iterations (two last maps), events with a similar phenotype are nicely grouped instead of stretched through different parts of the map) 

The hallmarks of lack of convergence of a tSNE-CUDA map, such as those seen above, are sometimes colloquially termed "filamenting" or "stretching" of events throughout the map that are expected to be grouped together spatially. Other visual cues indicating lack of convergence include tSNE-CUDA islands that are more pointed instead of having smoother rounded edges. In contrast to having too few iterations, using more iterations than required to separate groups of similar events or observations on a tSNE-CUDA map needlessly increases the run time of the algorithm. 

To manually adjust the number of iterations, just click on the number and type the number of iterations you prefer. 

Perplexity 

Perplexity can be thought of as a rough guess for the number of close neighbors (or similar points) any given event or observation will have. The algorithm uses it as part of calculating the high-dimensional similarity of two points before they are projected into low-dimensional space. The default perplexity setting in the Cytobank platform is 30 and works well for most single-cell datasets. Larger datasets may require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. 

 blobid6.png

(Effect of changing perplexity on different tSNE-CUDA runs for the same dataset with all other settings being equal. Perplexity of 30 is usually a good default and identifies population centers with superior resolution than a perplexity of 5. Increasing the perplexity separates more populations on the map) 

 

Learning rate  

Machine learning algorithms will learn from the data at a specific speed represented by the learning rate. If the Automatic box is checked, the learning rate will be set by dividing the number of events by the early exaggeration (see below), with a minimum of 200. Although we recommend leaving the default automatic learning rate, you can set it manually just by typing the desired number. Other tSNE implementations will use a default learning rate of 200; increasing this value may help obtain a better resolved map for some datasets. If the learning rate is set either too low or too high, the specific territories for the different cell types won’t be properly separated. The learning rate is also closely related to the number of iterations. If you decrease the learning rate you may consider increasing the number of iterations to obtain a tSNE-CUDA map with good cell type resolution. 

 blobid7.png

(Examples of a low (10, 5K), automatic (16666), and high learning rate (200K, 10M). Each tSNE-CUDA map represents a different run for the same dataset with all other settings being equal. An automatic learning rate is usually recommended. In this example data set when the learning rate was set too low specific territories are separated for the different cell types but not isolated in discrete islands. When the learning rate was set too high all events are positioned at the same distance from each other, resulting in cell types not separated in specific territories nor isolated into discrete islands) 

Early exaggeration 

During the early exaggeration phase tSNE-CUDA weights the global structure over the local structure. The exaggeration factor increases the movement on the events on the space allowing tSNE-CUDA to quickly learn the main pattern of the data and therefore allows the events to find their nearest neighbor more easily. As a result, the early exaggeration period controls how tightly natural clusters in the original high dimensional space are in the embedded space and how much space will be between them. When the early exaggeration phase ends, tSNE-CUDA will focus on learning the local structure. Increasing the early exaggeration factor may emphasize the global structure. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. 

UMAP 

UMAP, at its core, works very similarly to tSNE; both algorithms work towards reducing the dimensions while retaining the information of the high-dimensional data. To simplify, UMAP learns about the structure of your data in the high-dimensional space by constructing a graph representation, then it optimizes a low-dimensional space preserving the essential topological structure of the high-dimensional representation. Consequently, the UMAP algorithm may allow a better preservation of the data's global structure compared to the tSNE algorithms. There are three advanced settings that you can configure on a UMAP run. 

 blobid8.png

(UMAP advanced settings) 

Number of neighbors 

The most important parameter is the number of neighbors, which is the number of nearest neighbors used to construct the initial high-dimensional graph. Like that it controls how UMAP balances local versus global structure in the resulting map. Low values of number of neighbors will generate maps that will better preserve the local structure, while high values will produce maps representing the global structure potentially losing fine details of the structure. The default value is 15 but values can range from 2 to 100. 

    blobid9.png

(examples of a low, default (15), and high number of neighbors. Each UMAP map represents a different run for the same dataset with all other settings being equal)   

Minimum distance 

The minimum distance controls the distances between the events on the low-dimensional map. Therefore, low values will generate more compact embeddings, while high values will prevent UMAP from packing events, resulting in more disperse embeddings and focusing instead on the preservation of the broad topological structure. The default value is 0.01 but values can range from 0 to 0.99. 

 blobid10.png (Examples of a low, default (0.01), and high minimum distance. Each UMAP map represents a different run for the same dataset with all other settings being equal)

Collapse outliers 

UMAP offers the possibility to collapse outlier events. Dimension values are considered significant outliers based on a Z-score higher than 3. By default, this option will be selected and the outliers will be automatically collapsed to be equal to the minimum or maximum value. You should try unchecking the box if you observe that most of your data appear squeezed into a small region. 

opt-SNE 

opt-SNE can greatly decrease the number of iterations required to obtain visualizations of large cytometry datasets with superior quality and eliminates the need for trial-and-error runs intended to empirically find the most favorable selection of tSNE parameters, potentially saving many hours of computation time per research project. opt-SNE offers the option to adjust perplexity, learning rate, early exaggeration, seed, and maximum number of iterations. 

 blobid11.png

(opt-SNE advanced settings) 

Perplexity 

Perplexity can be thought of as a rough guess for the number of close neighbors (or similar points) any given event or observation will have. The algorithm uses it as part of calculating the high-dimensional similarity of two points before they are projected into low-dimensional space. The default perplexity setting in Cytobank is 30, and works well for most datasets. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. 

 blobid12.png

(Effect of changing perplexity on different opt-SNE runs for the same dataset with all other settings being equal. Perplexity of 30 is usually a good default and identifies population centers with superior resolution than a perplexity of 5. Increasing the perplexity separates more the populations on the map) 

 

Learning rate 

Machine learning algorithms will learn from the data at a specific speed represented by the learning rate. If the Automatic box is checked, the learning rate will be set by dividing the number of events by the early exaggeration (see below), with a minimum of 200. Although we recommend leaving the default automatic learning rate, you can set it manually just by typing the desired number. Other tSNE implementations will use a default learning rate of 200, increasing this value may help obtain a better resolved map for some data sets. If the learning rate is set too low or too high, the specific territories for the different cell types won’t be properly separated. 

 blobid13.png

(Examples of a low (10, 800), automatic (16666) and high (1M, 10M) learning rate. Each opt-SNE map represents a different run for the same dataset with all other settings being equal. An automatic learning rate is usually recommended) 

Early exaggeration 

During the early exaggeration phase, opt-SNE weights the global structure over the local structure. The exaggeration factor increases the movement on the events on the space allowing opt-SNE to quickly learn the main pattern of the data, and therefore allows events to find their nearest neighbor more easily. As a result, early exaggeration period controls how tight natural clusters in the original embedded space are in the embedded space, and how much space will be between them. When the early exaggeration phase ends, opt-SNE will focus on learning the local structure. Increasing the early exaggeration factor may emphasize the global structure. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. 

 Seed 

The opt-SNE algorithm is stochastic, meaning that it uses random steps to help it do its job. Since opt-SNE is run on a computer, the seed can be set using a pseudorandom number generator in the computer, allowing you to repeat the same opt-SNE run obtaining the exact same map as long as you use the exact same data and settings. This may be useful for validating workflows that include opt-SNE or repeating the same opt-SNE in a different Cytobank experiment (if you start with the exact same data and the exact same settings). Additionally, you may be able to test small changes to an opt-SNE run and use a set seed to help keep the islands in relatively similar locations. If you don’t do anything to set the opt-SNE seed,  Cytobank software will set one for you; after your opt-SNE has finished, you can see the seed that was selected by viewing the settings for the opt-SNE run. Then, use this same seed for any future run on the same data where you want to reproduce the results.  

There are two important ground rules to keep in mind for setting the seed:  

  1. You should not use the same seed for every different analysis you run. If you do, this limits the ability of the algorithm to be random, which is part of why the algorithm works well.
  2. If you pick your own value for the seed, pick an arbitrary number - something you get from a dollar bill or receipt in your wallet often works well.

Max iterations 

opt-SNE (as the rest of the tSNE analysis) works by repeatedly adjusting the placement of events or observations in a two-dimensional space in order to best reflect the similarity of these events or observations in the high-dimensional space of the dataset. Each step of this process is an iteration of the algorithm. Unlike some other machine learning algorithms, opt-SNE will automatically stop when convergence is reached. Thus, the number shown on this setting corresponds to the maximum number of iterations that the user wants the algorithm to perform (if needed). If the algorithm stopped at the max iteration, you may want to increase the max iteration to make sure the algorithm has enough iterations to optimize the map. 

viSNE 

The ability to set advanced settings for viSNE, including number of iterations, perplexity, and theta, is not available for user accounts with trial subscriptions except in special cases on Enterprise Cytobanks. The Barnes-Hut implementation of t-SNE, which underlies viSNE in Cytobank, has a number of settings that can be changed to affect the results of a viSNE analysis. 

 blobid14.png

(Advanced settings that can be configured in a viSNE run) 

Seed 

The viSNE algorithm is stochastic, meaning that it uses random steps to help it do its job. Since viSNE is run on a computer, the seed can be set using a computer-generated pseudorandom number allowing you to repeat the same viSNE run on the exact same data and get the exact same results. This may be useful for validating workflows that include viSNE or repeating the same viSNE in a different Cytobank experiment. Additionally, you may be able to test small changes to a viSNE run and use a set seed to help keep the islands in relatively similar locations. If you don’t do anything to set the viSNE seed, Cytobank will set one for you; after your viSNE is finished, you can see the seed that was selected by viewing the settings that were used for the viSNE run. Then, use this same seed for any future run on the same data where you want to reproduce the results. There are a few important ground rules to keep in mind for setting the seed: 1. You should not use the same seed for every different analysis you run. If you do, this limits the ability of the algorithm to be random, which is part of why the algorithm works well. 2. If you pick your own value for the seed, pick an arbitrary number - something you get from a dollar bill or receipt in your wallet often works well. 

Number of Iterations 

viSNE works by repeatedly adjusting the placement of events or observations in a two-dimensional space in order to best reflect the similarity of these events or observations in the high-dimensional space of the dataset. Each step of this process is an iteration of the algorithm. Unlike some other machine learning algorithms that stop iterating when convergence is reached, viSNE’s underlying math doesn’t allow this, and the number of iterations must be set by the user. The default number of iterations in Cytobank is 1000. 

For certain situations, this may not be enough iterations for the algorithm to converge on a useful result. The number of iterations needed depends on the data type, sample type, panel, and number of events included in the viSNE run. For fluorescent cytometry data, a good starting point is roughly 1000 iterations for every 100k events. Mass cytometry data typically requires fewer iterations than fluorescent data. 

When the number of iterations used is too low, the groups of cells or populations on the viSNE map will not be resolved: 

 blobid15.png

 

(Example of viSNE maps that did not have enough iterations. Each viSNE map represents a separate dataset and viSNE run from either mass or flow cytometry. Each dataset is colored by a major phenotyping marker such as CD3, CD4, or CD8. Each map shows a relative lack of resolution and spatial localization of events and would be difficult to use for gating or clustering to identify cellular populations. In a viSNE map that has had enough iterations, events with a similar phenotype should be nicely grouped instead of stretched through different parts of the map) 

The hallmarks of lack of convergence of a viSNE map, such as those seen above, are sometimes colloquially termed "balling," "filamenting," or "stretching" of events throughout the map that are expected to be grouped together spatially. Other visual cues indicating lack of convergence include viSNE islands that are more pointed instead of having smoother rounded edges. An additional trait that is correlated with viSNE maps that have not had enough iterations is a lower value range of the tSNE channels themselves. This can be seen above with tSNE channel ranges around 20-30 units. In the case of the map with the greatest tSNE range (bottom right), the results are also the most acceptable of the four maps displayed, even though they are still not ideal. Note: the relevance of range and the units correlated with a good viSNE map may change depending on other settings. 

In contrast to having too few iterations, using more iterations than required to separate groups of similar events or observations on a viSNE map needlessly increases the run time of the algorithm. 

Lowering the number of iterations can be considered in order to get results more quickly in situations where the algorithm is able to achieve adequate resolution with fewer iterations. Note that the relationship between KL divergence in a higher iteration run and successful resolution of a viSNE map is not linear. For example, just because KL divergence converges at iteration 700 in a successful 1000 iteration viSNE run does not mean that running this viSNE again with 700 iterations will result in successful resolution of the map. 

Perplexity 

Perplexity can be thought of as a rough guess for the number of close neighbors (or similar points) any given event or observation will have. The algorithm uses it as part of calculating the high-dimensional similarity of two points before they are projected into low-dimensional space. The default perplexity setting in Cytobank is 30 and works well for most datasets with 100 or more observations or events. 

The maximum perplexity allowed for any viSNE run is based on the number of events included in the run; Cytobank will warn you if your perplexity is set too high for viSNE to run. The perplexity cannot be greater than the number of events minus 1 divided by 3. For example, with 30 events in a viSNE run, the perplexity cannot be greater than (30 - 1) / 3, so it should be set to 9. 

Increasing the value of perplexity generally acts to accentuate spatial separation of events on the viSNE map. Decreasing this value has the opposite effect. The degree to which these differences practically matter in the resulting viSNE map depends on the analysis requirements. If there is not enough structure to capture distinct groups in the dataset (e.g., when the pre-gated population is CD4+ T cells), islands won’t form on the viSNE map even when perplexity is increased to the maximum. 

 blobid16.png

(The effect of changing perplexity on different viSNE runs for the same dataset with all other settings being equal. Perplexity of 30 is usually a good default and identifies population centers with superior resolution than a perplexity of 5. Increasing the perplexity separates these populations on the map more dramatically, but with diminishing practical value. KLD is KL divergence. The amount of time the run took to complete is also noted.) 

Theta 

Changing the value of theta is recommended only in the very rare case where your viSNE runs are failing or canceling due to memory limitations with large numbers of events, channels, iterations, or perplexity. In this case, we recommend increasing theta to 0.8 - 1. 

Theta controls how similar the Barnes-Hut implementation of tSNE is to the original tSNE algorithm (a lower value means it is more similar). The Barnes-Hut implementation of tSNE was created to allow the algorithm to be used on larger datasets (with more than a few thousand events total) with faster run times, so decreasing it is generally never recommended. In certain cases, it may be useful to increase theta to decrease algorithm run time, but this may result in groups of events or observations being separated on the map that don’t have meaningful differences in marker expression. 

 blobid17.png

(The influence of changing theta on viSNE results - all other settings default) 

Transformations 

Data scaling and Normalize scales 

It is essential to set up the scales correctly before the algorithm run. Scaling for DR algorithms is based on the scale settings in your experiment. Scale transformations and scale arguments will affect the results, whereas scale min and max will not (i.e. viSNE doesn’t care if data are piled on the edge of a 2D plot). In general, if data are scaled appropriately for normal analysis involving manual gating, then the scale settings are fine for a DR analysis. Note that having log-based scales may cause algorithm failures due to attempted calculation of the logarithm of a negative number or zero. For log-like scaling that handles negative numbers and zero, please use arcsinh scales.   

tSNE-CUDA, UMAP and opt-SNE offer channel normalization. When the Normalize scales option is selected, each channel is normalized such that it has a mean value of zero and a standard deviation of 1. This is done by first concatenating all files within the analysis, then for each event value per channel, subtracting the mean and dividing by the standard deviation of the channel. Normalizing scales can be a useful strategy when channels have different dynamic ranges, as is often the case in fluorescence flow cytometry.  

Compensation 

Compensation should be applied to fluorescent data before running a DR algorithm as it would be for any other analysis of these data by selecting the appropriate compensation. The Cytobank platform uses the experiment-wide compensation to govern how compensation is applied to GatesIllustrations, and Advanced analyses. For files uploaded by DROP, or FCS files that have no internal compensation matrix, you can leave the default option (file-internal compensation) and no compensation will be applied. 

For a DR run, the experiment-wide compensation will be used. If you wish to use another compensation, select it on the Compensation editor. Please be mindful when you do that since changing of the experiment-wide compensation also affects your previously made gates and illustrations. Please see How the experiment-wide compensation works in Cytobank for further information. 

Effect of Advanced Settings on Algorithm Run Time 

The amount of time that each DR algorithm takes to execute can vary substantially due to a number of factors. Read analysis on how dimensionality reduction settings affect algorithm run time. 

  

For Research Use Only. Not for use in diagnostic procedures. 

© 2021 Beckman Coulter, Inc. All rights reserved. Beckman Coulter, the stylized logo and the Beckman Coulter product and service names mentioned herein are trademarks or registered trademarks of Beckman Coulter, Inc. in the United States and other countries. All other trademarks are the property of their respective owners. 

 



Have more questions? Submit a request