Introduction
Train a new model with the automatic gating algorithm.
- Select populations
- Select FCS files
- Event sampling
- Setting a seed and advanced options
- Data scaling and compensation
Introduction
Manual gating is often cited as a major contributor to variability in cytometry assays (Maecker et al. 2005). Manual gating is also very time-consuming. Machine learning-assisted analysis of cytometry data has proven advantageous, however, there is a lack of methods that allow users to automate their own user-defined gating strategy for their specific marker panel in a hypothesis-driven setting (Hu et al. 2022).
Here we introduce the automatic gating approach in the Cytobank platform that allows you to train a model on your manual gating strategy on a small number of samples -- without requiring coding skills. The trained model can be applied to a new dataset to identify populations/subsets present in your cytometry data.
Train new model with the automatic gating algorithm.
To train a model, you’ll need to manually gate at least 10 experimental files. Once you have established your gating strategy and made the necessary tailoring by file and/or population, apply the gates. During model training you can establish a model that is specific for your panel and gating strategy and that can subsequently be used for the automatic analysis of new data acquired using the same marker panel.
You can initiate model training by going to the Advanced Analysis > Automatic gating: Select Train new model > Create.
Select the Gating Group
The automatic gating model can only predict populations from one Gating Group. If you have more than one Gating Group in your experiment, you may need to train multiple models. Select the appropriate Gating Group that contains the populations you would like to predict in the Gating Group selector box. This will update the Populations selector box to show all available populations from the selected Gating Group.
Select populations
Select the populations that you want the model to predict or, in other words, to be able to gate automatically. Note that you can only select populations that are present in the Gating Group chosen in the previous step. It is necessary to include all populations upstream of a given population. The population selector will make these adjustments automatically for you. All channels used in the population hierarchy to define the selected populations will be taken into account when training the model.
By default, all populations are considered equally important. You can assign a higher weight to one or several populations. For every model that is trained, a key performance indicator (KPI) will be determined that provides a measure of how well a model performs. The KPI weight determines how much the performance of individual populations impacts the overall performance score of the model. By increasing the weight of populations you especially care about, you can bias model training to return a model that performs better for these populations.
Please note that the most important factor for model performance is the quality of the manual gating on the training data.
Select FCS files
In the next step, you can select the files to be included in the training process. To ensure the model can work well on new data, the files included in the model training should reflect the expected variability, e.g., from reagent batch effects or variability caused by slight changes in instrument settings or biological variability. There are minimum requirements to have at least 10 files, and it is recommended to have as many samples for training as possible.
To make file selection easier, the FCS files selector box will only let you choose files from the Gating Group you selected. Any files from other Gating Groups will be greyed out (TO CONFIRM). In addition, you can also see all sample tags associated with any file. By clicking on a sample tag, you can filter the file list.
The files you include will automatically be split into a training, a validation and a test set. The training set is used to identify which settings or hyperparameter values lead to a model that best fits the training data.
The validation set is a part of the data that is excluded from model training and used to give an estimate of the model performance while the model’s hyperparameters are being tuned.
The test set is used to assess the performance of a fully trained model. As the files in the test set were not used to train the model or to evaluate the performance in the context of hyperparameter tuning, it can be used to understand how accurately the model can predict populations in new data to understand the generalization error. On the Cytobank platform, the test set is used to determine the KPI of the model.
Event sampling
The training of machine learning models can be compute-intense and time consuming. To reduce the time required for large training datasets, the Cytobank automatic gating functionality supports up to 4 million events for model training and allows you to apply a downsampling method.
The event sampling offers 3 options:
- Equal: the same number of events is used from every sample.
- Proportional: Each population/file combination present will be sampled according to its relative abundance as a function of total available events.
- Use all events.
The downsampling for automatic gating model training optimizes maximum population diversity across all training set samples. It does not randomly select events, but samples events from all manually gated populations to ensure that events from all populations are included for training. This reduces the risk of losing low frequency populations.
Smart downsampling ensures representation of rare populations
Setting a seed and optional settings
Setting a seed
Many decisions in training machine learning models involve a random component. These include the sampling of training data as well as the parameters used for model training. To ensure the repeatability of training results, a user-defined seed can be set instead of starting with a random seed. This allows the same decisions for sample and parameter selection to be made across training runs. In general, it is advisable to start with a random seed set by the algorithm, and the Cytobank platform saves the seed value for training runs on the setup page after the run is completed. The seed can be used to repeat the model training.
Optimal clusters
The best estimate of the number of distinct groups of files amongst those selected. Usually, this aligns with how you would sample tag your files into different conditions or time points. It helps the algorithm pick representative samples and perform better.
Training magnification
By increasing the magnification, the user can determine how many different models are being trained using different parameters on the same training data. The model with the highest KPI will be returned to the user. By increasing the weight of a population in a training run with a magnification greater than 1, you may be able to influence the model selection to return a model performing better on your population of interest. Of note, increasing the magnification also causes a proportional increase in runtime. It may also cause the run to crash due to memory constraints if there are millions of events.
Quad/split gate events
If you are training a model with gating strategies that contain quadrant gates, situations may occur where one event is assigned to several quadrants in the trained model, and some events may not be assigned to any of the quadrants. The Cytobank platform treats quadrant gates comparably to four rectangular gates by default. The automatic training model, therefore, predicts whether or not a cell belongs to a population independently for each quadrant. For cells that appear unlike cells in the training dataset, this might lead to no population assignment from a set of quadrants. For cells that would fall on the edge between two quadrants, this may result in a cell being assigned to both populations. Treating all populations at the same hierarchy level independently rather than enforcing exclusive relationships provides you more freedom when developing your gating strategy. If it is important to you that all cells get assigned to exactly one quadrant population or one split population, so you may choose mutually exclusive and exhaustive (MECE) event assignment. To do so, select the Enforce MECE event membership option. You will also have to ensure you check the MECE box when running automatic gating inference. Note that this setting influences the algorithm run time.
Data scaling and compensation
The automatic gating model learns from patterns in the compensated and scaled data. Like with all cytometry data analysis it is important to make the necessary adjustments to account for fluorescence spillover in flow cytometry data and to scale data appropriately. The quality of compensation and scaling should be comparable between the dataset used for model training and any new dataset for which the model is used to conduct inference.
Please make any desired adjustments on compensation and scaling on the setup page before starting the Autogating analysis.
After the setting selection, click on the green button on the top right to run the Autogate training analysis. You will receive an e-mail notification once it is completed. You can access the results by clicking the link in the notification e-mail or via the experiment navigation bar, Action>View experiment summary to access the results.
Test set review
There is an option to create an experiment with blind test files and their inferred populations. It can make it easier to visually evaluate model performance. If the option is selected, the automatically created child experiments contain the subset of FCS files that were assigned to the blind test set. For every predicted population, the files now contain one additional parameter following the naming convention of auto_gate_Population name. Please see this article for more information about the created experiment
Learn how to interpret the analysis result in this article.
Have more questions? Submit a request
For Research Use Only. Not for use in diagnostic procedures.