CN116469473B - Model training method, device, equipment and storage medium for T cell subtype identification - Google Patents

Model training method, device, equipment and storage medium for T cell subtype identification Download PDF

Info

Publication number
CN116469473B
CN116469473B CN202310708381.8A CN202310708381A CN116469473B CN 116469473 B CN116469473 B CN 116469473B CN 202310708381 A CN202310708381 A CN 202310708381A CN 116469473 B CN116469473 B CN 116469473B
Authority
CN
China
Prior art keywords
cells
model
sequencing data
data
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310708381.8A
Other languages
Chinese (zh)
Other versions
CN116469473A (en
Inventor
史植文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyin Oriental Transformation Medical Research Center Co ltd
Original Assignee
Beijing Zhiyin Oriental Transformation Medical Research Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyin Oriental Transformation Medical Research Center Co ltd filed Critical Beijing Zhiyin Oriental Transformation Medical Research Center Co ltd
Priority to CN202310708381.8A priority Critical patent/CN116469473B/en
Publication of CN116469473A publication Critical patent/CN116469473A/en
Application granted granted Critical
Publication of CN116469473B publication Critical patent/CN116469473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides a model training method, device, equipment and storage medium for T cell subtype identification, and relates to the technical field of biology, wherein the method comprises the following steps: acquiring a preset data set for establishing a model; extracting sequencing data of T cells from the data set of the established model based on the expression quantity of Marker genes corresponding to the sequencing data of the data set of the established model; determining a first correspondence between sequencing data of the T cells and tumor-specific T cells under the condition that cells corresponding to the sequencing data of the T cells support annotation information identifying the tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying the tumor; and training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.

Description

Model training method, device, equipment and storage medium for T cell subtype identification
Technical Field
The invention relates to the technical field of biology, in particular to a model training method, device and equipment for T cell subtype identification and a storage medium.
Background
Tumor-specific T cells are the primary lymphocytes that recognize and kill tumors; in addition, the identification of T Cell Receptor (TCR) of tumor specific T cells can also provide clinical monitoring biomarkers for patient treatment, which are used for tracking the clinical curative effect of anti-tumor immune response and deeply researching the biological mechanism of tumor immune treatment.
Currently, the conventional method for identifying tumor-specific T cells is an ex vivo T cell functional test.
However, the above identification process has high requirements on laboratory platforms and long identification period; moreover, a large part of tumor-specific T cells may be missed, for example T cells that may miss endogenous viral antigens or eventually depleted T cells that cannot be activated in vitro, and the identification accuracy of tumor-specific T cells is low.
Disclosure of Invention
The invention provides a model training method, device, equipment and storage medium for T cell subtype identification, which are used for solving the problems of high requirements on a laboratory platform, long identification period and low identification accuracy in the identification of tumor specific T cells in the prior art.
The invention provides a model training method for T cell subtype identification, which comprises the following steps:
Acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
determining a first correspondence between sequencing data of the T cells and tumor-specific T cells, if cells corresponding to the sequencing data of the T cells support annotation information identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
and training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.
According to the model training method for T cell subtype identification provided by the invention, the acquisition of a preset data set for establishing a model comprises the following steps:
acquiring a preset candidate data set;
filtering the sequencing data of the candidate data set to obtain the data set of the established model;
Wherein the filtering operation comprises the steps of:
removing sequencing data from the candidate data set, wherein the detected number of genes is smaller than a first threshold value;
removing sequencing data with the number of specific molecular tags UMI less than a second threshold from the candidate data set;
removing sequencing data with the ratio of the mitochondrial gene expression amount of UMI greater than a third threshold value from the candidate data set;
and removing sequencing data corresponding to the double cells from the candidate data set.
According to the model training method for T cell subtype identification provided by the invention, the expression quantity of the Marker gene corresponding to the sequencing data based on the established model dataset, the sequencing data of T cells are extracted from the established model dataset, and the method comprises the following steps:
extracting first candidate sequencing data from the modeling dataset based on the expression quantity of Marker genes corresponding to the sequencing data of the modeling dataset;
and removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells.
According to the model training method for identifying the subtype of the T cell, the T cell receptor gene and the tissue dissociation induction gene are removed from the hypervariable genes of the first candidate sequencing data, so that the sequencing data of the T cell are obtained, and the model training method comprises the following steps:
Removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain second candidate sequencing data;
and processing the second candidate sequencing data through a preset SCTransform algorithm to obtain sequencing data of the T cells.
According to the model training method for T cell subtype identification provided by the invention, the training of a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model comprises the following steps:
setting parameters of a first candidate model which is preset through an extreme gradient lifting algorithm to obtain a preliminary identification model; wherein the parameters include at least one of: maximum depth of tree, learning rate and sampling percentage;
taking a preset logistic regression model as a classification model;
and obtaining the model to be trained based on the preliminary identification model and the classification model.
According to the model training method for T cell subtype identification provided by the invention, the model to be trained is obtained based on the preliminary identification model and the classification model, and comprises the following steps:
Obtaining a second candidate model based on the preliminary identification model and the classification model;
and calculating target superparameters of the second candidate model through a preset 10-time cross validation algorithm, and optimizing the second candidate model based on the target superparameters to obtain the model to be trained.
The invention also provides a model training device for T cell subtype identification, which comprises the following components:
the acquisition module is used for acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
the extraction module is used for extracting sequencing data of T cells from the data set of the established model based on the expression quantity of the Marker gene corresponding to the sequencing data of the data set of the established model;
a determining module, configured to determine a first correspondence between sequencing data of the T cells and tumor-specific T cells, in the case where cells corresponding to the sequencing data of the T cells support annotation information for identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
And the training module is used for taking the first corresponding relation and the second corresponding relation as training data, training a preset model to be trained, and obtaining a T cell subtype identification model.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a model training method for T cell subtype identification as described in any one of the above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements a model training method of T cell subtype identification as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a model training method for T cell subtype identification as described in any one of the above.
Compared with the method, the device, the equipment and the storage medium for model training for identifying the T cell subtype, compared with the method, the device, the equipment and the storage medium for identifying the tumor-specific T cell by an in-vitro T cell function test in the related technology, the method, the device and the storage medium for model training for identifying the T cell subtype have the problems of high requirements on a laboratory platform, long identification period and low identification accuracy, and the T cell subtype identification model trained by the embodiment of the invention is used for identifying the tumor-specific T cell, is simple to operate and high in analysis efficiency, effectively reduces the identification period, and improves the identification accuracy of the tumor-specific T cell.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a model training method for T cell subtype identification provided by the invention;
FIG. 2 is a second flow chart of the model training method for T cell subtype identification provided by the present invention;
FIG. 3 is a schematic diagram of an example of an identification result in a model training method for T cell subtype identification provided by the invention;
FIG. 4 is a bar graph of the distribution ratio of tumor-specific T cells and other T cell clones in the model training method for T cell subtype identification provided by the present invention;
FIG. 5 is a graph of subject performance characteristics in a model training method for T cell subtype identification provided by the present invention;
FIG. 6 is a graph of accurate recall in a model training method for T cell subtype identification provided by the present invention;
FIG. 7 is a graph of validation in a model training method for T cell subtype identification provided by the present invention;
FIG. 8 is a schematic diagram of the structure of a model training device for T cell subtype identification provided by the invention;
fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The model training method, device, equipment and storage medium for T cell subtype identification of the present invention are described below with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a model training method for T cell subtype identification, which is shown in FIG. 1, and comprises steps 101 to 104; wherein:
step 101, acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
102, extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
step 103, determining a first corresponding relation between the sequencing data of the T cells and tumor specific T cells under the condition that cells corresponding to the sequencing data of the T cells support annotation information for identifying tumors; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
and 104, training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.
In the related art, a conventional method for identifying tumor-specific T cells is an ex vivo T cell function test. This screening process requires a high laboratory platform, has a long detection period, and can miss a large portion of tumor-specific T cells, such as T cells that recognize endogenous viral antigens or eventually depleted T cells that cannot be activated in vitro.
The above-mentioned disadvantages greatly limit the clinical use of T cell adoptive therapies with T cell receptor engineering. In recent years, single cell sequencing technology has gradually revealed the biological properties of tumor-specific T cells in terms of their application in research, for example, these T cells exhibit a higher depletion index. This makes it possible to identify tumor-specific T cells using the single cell transcriptome characteristics of the T cells.
In the embodiment of the invention, firstly, a data set of a built model of single-cell sequencing data comprising tumor specific T cells is obtained; the modeled data set may be downloaded, for example, from a published public database.
Alternatively, tumor-specific T cells may include cd8+ T cells and cd4+ T cells.
After the data set of the established model is obtained, the expression quantity of the Marker gene can be counted and single cell subgroup classification can be carried out based on the expression quantity of the Marker gene corresponding to the sequencing data in the data set of the established model so as to extract the sequencing data of the T cells from the data set of the established model.
After the sequencing data of the T cells are obtained, each T cell can be divided into tumor specific T cells and non-tumor specific T cells according to whether the T cells support the annotation information for identifying the tumor or not, so that the embodiment of the invention judges whether the cells corresponding to the sequencing data of the T cells support the annotation information for identifying the tumor or not, so that the cells corresponding to the sequencing data of the T cells are classified into the tumor specific T cells or the non-tumor specific T cells, a first corresponding relation between the sequencing data of the tumor specific T cells and the T cells corresponding to the tumor specific T cells and a second corresponding relation between the non-tumor specific T cells and the sequencing data of the T cells corresponding to the non-tumor specific T cells are determined, the first corresponding relation and the second corresponding relation are further used as training data, and a T cell subtype identification model is obtained through training a training mode of supervised learning.
Alternatively, the first correspondence and the second correspondence obtained above may be used as an input data set, where a training set of 70% of the data amount is included, and the remaining 30% of the data is used as a verification set.
Optionally, performance evaluation may also be performed on the trained T cell subtype identification model, for example, calculating an accuracy rate, recall rate, F value of identification, and a subject operating characteristic curve (receiver operating characteristic curve, ROC)/AUC (Area Under Curve) curve, wherein the AUC curve is used to characterize an area enclosed by the axis under the ROC curve.
Alternatively, after a T cell subtype identification model is obtained, tumor-specific T cells, such as cd8+ T lymphocytes, may be identified by the T cell subtype identification model.
In the model training method for T cell subtype identification provided by the embodiment of the invention, a data set of a built model including single-cell sequencing data of tumor specific T cells is firstly obtained, sequencing data of T cells is extracted from the data set of the built model based on Marker gene expression quantity corresponding to the sequencing data in the data set of the built model, then whether the cells corresponding to the sequencing data of the T cells support annotation information for identifying tumors is judged, so that the cells corresponding to the sequencing data of the T cells are classified into tumor specific T cells or non-tumor specific T cells, a first corresponding relation between the tumor specific T cells and the sequencing data of the T cells corresponding to the tumor specific T cells and a second corresponding relation between the non-tumor specific T cells and the sequencing data of the T cells corresponding to the non-tumor specific T cells are determined, and further the first corresponding relation and the second corresponding relation are used as training data, and a model to be trained in a training mode of supervised training learning, so that the T cell subtype identification model is obtained. Compared with the method for identifying the tumor-specific T cells by in-vitro T cell function test in the related art, the method has the advantages that the requirements on a laboratory platform are high, the identification period is long, and the identification accuracy is low, the T cell subtype identification model trained by the embodiment of the invention is used for identifying the tumor-specific T cells, the operation is simple, the analysis efficiency is high, the identification period is effectively reduced, and the identification accuracy of the tumor-specific T cells is improved.
Optionally, the implementation manner of obtaining the preset data set of the modeling may include:
acquiring a preset candidate data set;
filtering the sequencing data of the candidate data set to obtain the data set of the established model;
wherein the filtering operation comprises the steps of:
1) Removing sequencing data from the candidate data set, wherein the detected number of genes is smaller than a first threshold value;
in particular, for example, when a gene (gene) identified in a single cell is detected in less than 3 cells, the sequencing data corresponding to that cell may be removed from the candidate dataset.
2) Removing from the candidate dataset sequencing data for which the number of specific molecular tags (Unique molecularidentifier, UMI) is less than a second threshold;
specifically, in the event of an abnormality in the sequencing Counts data, the number of UMIs may be less than a second threshold, e.g., the total number of UMIs in a single cell is less than 200, at which time the sequencing data may be removed from the candidate dataset.
3) Removing sequencing data with the ratio of the mitochondrial gene expression amount of UMI greater than a third threshold value from the candidate data set;
specifically, in the case where the mitochondrial genome ratio is too high, the mitochondrial gene expression amount ratio of UMI may be greater than a third threshold, for example, the mitochondrial gene expression amount ratio of UMI in a single cell is greater than 20%, at which time the sequencing data may be removed from the candidate dataset.
4) And removing sequencing data corresponding to the double cells from the candidate data set.
Specifically, based on a preset double Finder algorithm, sequencing data corresponding to double cells in the sequencing data of the candidate data set can be analyzed, and the sequencing data corresponding to the double cells can be removed.
In the embodiment of the invention, the quality control is carried out on the sequencing data of the candidate data set so as to filter the sequencing data corresponding to the low-quality cells to obtain the data set of the established model, so that the data quality of the training data can be effectively improved, and the identification accuracy of the T cell subtype identification model obtained by training is further improved.
Optionally, the implementation manner of extracting the sequencing data of the T cells from the modeling dataset based on the expression level of the Marker gene corresponding to the sequencing data of the modeling dataset may include:
extracting first candidate sequencing data from the modeling dataset based on the expression quantity of Marker genes corresponding to the sequencing data of the modeling dataset;
and removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells.
Specifically, after extracting the first candidate sequencing data from the data set based on the expression level of the Marker gene corresponding to the sequencing data of the data set of the model, the hypervariable genes in the first candidate sequencing data can be filtered, and specifically, the T cell receptor genes and the tissue dissociation induction genes (or referred to as tissue dissociation induction genes) are removed, so as to obtain sequencing data of T cells.
The hypervariable gene means: the gene with the largest expression difference is selected when comparing cells, and the identification accuracy of tumor specific T cells is improved based on the hypervariable gene, thereby being beneficial to identifying different types of cells.
Optionally, the implementation of removing the T cell receptor gene and the tissue dissociation inducing gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells may include:
removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain second candidate sequencing data;
and processing the second candidate sequencing data through a preset SCTransform algorithm to obtain sequencing data of the T cells.
Specifically, after filtering the hypervariable genes of the first candidate sequencing data, the sequencing data can be processed through a preset SCTransform algorithm to obtain sequencing data of the T cells; the SCTransform algorithm can scale and reduce the dimension of the sequencing data, realize the uniformity of the expression quantity, remove the influence of the sequencing depth, effectively improve the data quality of the training data, and further improve the identification accuracy of the T cell subtype identification model obtained by training.
Alternatively, the sequencing data may be processed using the SCTransform algorithm of single cell analysis software setup to achieve expression level uniformity, removing sequencing depth effects.
Optionally, the training the preset model to be trained by using the first correspondence and the second correspondence as training data, and the implementation manner of obtaining the T cell subtype identification model may include:
setting parameters of a first candidate model which is preset through an extreme gradient lifting algorithm to obtain a preliminary identification model; wherein the parameters include at least one of: maximum depth of tree, learning rate and sampling percentage;
taking a preset logistic regression model as a classification model;
And obtaining the model to be trained based on the preliminary identification model and the classification model.
Specifically, an extreme gradient lifting algorithm can be used for setting parameters of the first candidate model to obtain a preliminary identification model; the set parameters may include at least one of: maximum depth of tree, learning rate and sampling percentage; selecting a logistic regression model as a classification model; and obtaining a model to be trained based on the preliminary identification model and the classification model. The embodiment of the invention provides a specific implementation mode for acquiring a model to be trained.
Optionally, the implementation manner of obtaining the model to be trained based on the preliminary identification model and the classification model may include:
obtaining a second candidate model based on the preliminary identification model and the classification model;
and calculating target superparameters of the second candidate model through a preset 10-time cross validation algorithm, and optimizing the second candidate model based on the target superparameters to obtain the model to be trained.
Specifically, according to a 10-time cross validation algorithm, the optimal super parameter of the model is calculated as a target super parameter, the optimized model is obtained and used as a T cell subtype identification model, and the identification accuracy of the T cell subtype identification model can be effectively improved.
The following illustrates a model training method for T cell subtype identification provided by the examples of the present invention.
The detection sample used is easy to obtain based on single cell transcriptome data of infiltrating lymphocytes in tumor surgical excision samples or puncture samples of cancer patients. Compared with the conventional experimental flow for identifying the tumor-specific T cells, the T cell subtype identification model for identifying the tumor-specific T cells, which is established by the invention, greatly shortens the identification period and the identification cost of the tumor-specific T cells, and the analysis result shows that 99% of tumor-infiltrating CD8+ T lymphocytes to be detected can be correctly classified. The identification method is simple to operate and high in analysis efficiency; the detection result is combined with single-cell immune group sequencing data, and T cell receptor sequence information of tumor specific T cells can be directly obtained, so that a foundation is laid for the subsequent treatment of engineering T cell receptor cells. Therefore, the T cell subtype identification model trained by the T cell subtype identification model training method provided by the invention can be used as an effective screening tool for adoptive cell therapy, so that the T cell subtype identification model can be widely applied to the field of tumor immunotherapy.
1. The model training method for T cell subtype identification comprises the following steps:
S1, acquiring a single-cell sequencing data set (a data set for establishing a model) containing new antigen specific CD8+ T cells (tumor specific T cells), wherein the data set is downloaded from a published public database;
s2, quality control is carried out on single cell transcriptome sequencing data in the data set for establishing the model: according to the detected number of genes in each single cell, the sequencing count number and the proportion of mitochondrial genome, removing single cell sequencing data with excessive or insufficient expression of the number of genes, abnormal sequencing count data and high proportion of mitochondrial genome in single cell sequencing data, and filtering out single cell sequencing data of double cells;
specifically, single cell transcriptome sequencing data was quality controlled, and low quality cell filtration treatments were performed according to the following criteria:
1) Genes identified in single cells were detected in less than 3 cells;
2) The total number of UMIs in a single cell is less than 200;
3) The ratio of the expression quantity of the mitochondrial gene of UMI in single cells is more than 20%;
4) According to the analysis result of the double Finder, the double cells were removed.
S3, based on single-cell transcriptome sequencing data after quality control and filtration, counting the expression quantity of Marker genes, classifying single-cell subsets, and extracting single-cell transcriptome sequencing data of tumor-infiltrated CD8+ T cells; filtering the hypervariable genes, and filtering out T cell receptor genes and induction expression genes (tissue dissociation induction genes) in the tissue dissociation process;
S4, scaling single-cell transcriptome sequencing data of tumor-infiltrated CD8+ T cells, specifically using a single-cell analysis software SEurat SCTransform algorithm to perform expression quantity homogenization and remove sequencing depth influence; the hypervariable gene scaling data in which the residuals are top 1500 may be extracted for subsequent use;
alternatively, ranking may be based on the relationship of the expression mean and variance.
S5, dividing each CD8+ T cell into tumor-specific T cells and non-tumor-specific T cells according to annotation information of whether tumors are identified, and integrating the hypervariable gene scaling data in S4 to serve as an input data set of a machine learning model (model to be trained);
specifically, the input data set may be divided into training sets containing 70% data amount, and the remaining 30% data is used as verification set;
s6, setting parameters of a preliminary identification model, including maximum depth of a tree, learning rate and sampling percentage, by using an extreme gradient lifting algorithm, and selecting a logistic regression model as a classification model;
s7, calculating optimal super parameters of the model according to a 10-time cross validation technology to obtain an optimized new antigen specific CD8+ T cell subtype identification model;
and S8, performing performance evaluation on the established machine learning model (T cell subtype identification model) including calculating accuracy, recall, F value and ROC/AUC curve.
2. FIG. 2 is a second flow chart of the model training method for T cell subtype identification provided by the invention, as shown in FIG. 2, comprising the following steps:
1. tumor-specific T cell data collection;
2. quality control and expression quantity quantification of sequencing data;
3. annotating cd8+ T cells;
4. cd8+ T cell expression matrix data washing;
5. and (5) establishing a machine learning model.
Specifically, firstly, carrying out data filtering, comparison, quantification and identification on collected original data (a data set for establishing a model) to obtain a gene expression matrix of CD8+ T cells (single-cell transcriptome sequencing data of CD8+ T cells), then carrying out further data filtering, standardization and scaling, and finally, identifying new antigen specific CD8+ T cells through a machine learning algorithm (a T cell subtype identification model) of tumor specific T cells; the specific steps of the analysis method are as follows:
(1) And (3) data quality control: selecting single-cell sequencing data of tumor tissues, and performing quality control filtering on the single-cell sequencing data by using the SEURat software;
(2) Cd8+ T cell identification: based on single-cell sequencing data after quality control filtration, counting the expression quantity (CD 3D, CD3G, CD8A, CD8B and CD 45) of Marker genes, identifying CD8+ T cells, and then extracting single-cell transcriptome sequencing data of the CD8+ T cells;
(3) Single cell transcriptome sequencing data scaling and filtering for cd8+ T cells: carrying out expression quantity homogenization by using a single cell analysis software SEurat SCTransform algorithm, and removing the influence of sequencing depth; filtering the hypervariable genes, removing T cell receptor genes and tissue dissociation induction genes, and extracting the scaling data of the top 1500 hypervariable genes;
(4) The hypervariable gene scaling data in the step (3) is used as an input data set of a machine learning model (model to be trained) for identifying tumor-specific T cells, and identification is carried out, and fig. 3 is a schematic diagram of an example of identification results in a model training method for identifying T cell subtypes, which is provided by the invention, as shown in fig. 3, by unifying manifold approximation and projection (uniform manifold approximation and projection, UMAP) diagrams, the distribution of new antigen-specific T cells (namely tumor-specific T cells) in infiltrating CD8+ T cells of a tumor patient is shown;
(5) According to the TCR sequence information of the identified novel antigen specific CD8+ T cells, the amplification condition is judged, and the result is shown in FIG. 4, and FIG. 4 is a distribution proportion bar graph of tumor specific T cells and other T cell clones in the model training method for T cell subtype identification provided by the invention.
In addition, fig. 5 is a graph of the operating characteristics of a subject in the model training method for T cell subtype identification provided by the present invention, as shown in fig. 5, which is used to demonstrate the specificity and sensitivity of the model.
FIG. 6 is a graph of accurate recall curves in the model training method for T cell subtype identification provided by the invention, as shown in FIG. 6, for demonstrating recall rate and accuracy of the model.
FIG. 7 is a graph of validation in a model training method for T cell subtype identification provided by the present invention, as shown in FIG. 7, for showing that the model is not over-fitted or under-fitted.
The following describes a model training device for T cell subtype identification provided by the present invention, and the model training device for T cell subtype identification described below and the model training method for T cell subtype identification described above can be referred to correspondingly.
Fig. 8 is a schematic structural diagram of a model training device for T cell subtype identification provided by the present invention, and as shown in fig. 8, a model training device 800 for T cell subtype identification includes:
an obtaining module 801, configured to obtain a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
An extraction module 802, configured to extract sequencing data of T cells from the modeling dataset based on an expression level of a Marker gene corresponding to the sequencing data of the modeling dataset;
a determining module 803, configured to determine a first correspondence between sequencing data of the T cells and tumor-specific T cells, in a case where cells corresponding to the sequencing data of the T cells support annotation information for identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
the training module 804 is configured to train a preset model to be trained by using the first correspondence and the second correspondence as training data, so as to obtain a T cell subtype identification model.
In the model training device for T cell subtype identification provided by the embodiment of the invention, firstly, an acquisition module acquires a data set of a built model of single-cell sequencing data comprising tumor specific T cells, an extraction module extracts sequencing data of the T cells from the data set of the built model based on Marker gene expression quantity corresponding to the sequencing data in the data set of the built model, a determination module judges whether cells corresponding to the sequencing data of the T cells support annotation information for identifying tumors, so as to classify the cells corresponding to the sequencing data of the T cells into tumor specific T cells or non-tumor specific T cells, determine a first corresponding relation between the tumor specific T cells and the sequencing data of the T cells corresponding to the tumor specific T cells, and a second corresponding relation between the non-tumor specific T cells and the sequencing data of the T cells corresponding to the non-tumor specific T cells, and further, a training module trains the model to be trained by taking the first corresponding relation and the second corresponding relation as training data, and a training mode with supervised learning, thereby obtaining a T cell subtype identification model. Compared with the method for identifying the tumor-specific T cells by in-vitro T cell function test in the related art, the method has the advantages that the requirements on a laboratory platform are high, the identification period is long, and the identification accuracy is low, the T cell subtype identification model trained by the embodiment of the invention is used for identifying the tumor-specific T cells, the operation is simple, the analysis efficiency is high, the identification period is effectively reduced, and the identification accuracy of the tumor-specific T cells is improved.
Optionally, the obtaining module 801 is specifically configured to:
acquiring a preset candidate data set;
filtering the sequencing data of the candidate data set to obtain the data set of the established model;
wherein the filtering operation comprises the steps of:
removing sequencing data from the candidate data set, wherein the detected number of genes is smaller than a first threshold value;
removing sequencing data with the number of specific molecular tags UMI less than a second threshold from the candidate data set;
removing sequencing data with the ratio of the mitochondrial gene expression amount of UMI greater than a third threshold value from the candidate data set;
and removing sequencing data corresponding to the double cells from the candidate data set.
Optionally, the extracting module 802 is specifically configured to:
extracting first candidate sequencing data from the modeling dataset based on the expression quantity of Marker genes corresponding to the sequencing data of the modeling dataset;
and removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells.
Optionally, the extracting module 802 is further specifically configured to:
removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain second candidate sequencing data;
And processing the second candidate sequencing data through a preset SCTransform algorithm to obtain sequencing data of the T cells.
Optionally, the training module 804 is specifically configured to:
setting parameters of a first candidate model which is preset through an extreme gradient lifting algorithm to obtain a preliminary identification model; wherein the parameters include at least one of: maximum depth of tree, learning rate and sampling percentage;
taking a preset logistic regression model as a classification model;
and obtaining the model to be trained based on the preliminary identification model and the classification model.
Optionally, the training module 804 is further specifically configured to:
obtaining a second candidate model based on the preliminary identification model and the classification model;
and calculating target superparameters of the second candidate model through a preset 10-time cross validation algorithm, and optimizing the second candidate model based on the target superparameters to obtain the model to be trained.
Fig. 9 is a schematic structural diagram of an electronic device provided by the present invention, and as shown in fig. 9, the electronic device may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a model training method for T cell subtype identification, the method comprising:
Acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
determining a first correspondence between sequencing data of the T cells and tumor-specific T cells, if cells corresponding to the sequencing data of the T cells support annotation information identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
and training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.
Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a model training method for T cell subtype identification provided by the methods described above, the method comprising:
acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
determining a first correspondence between sequencing data of the T cells and tumor-specific T cells, if cells corresponding to the sequencing data of the T cells support annotation information identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
and training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a model training method for T cell subtype identification provided by the methods above, the method comprising:
acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
determining a first correspondence between sequencing data of the T cells and tumor-specific T cells, if cells corresponding to the sequencing data of the T cells support annotation information identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
and training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A model training method for T cell subtype identification, comprising:
acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
extracting sequencing data of T cells from the data set of the established model based on the expression quantity of a Marker gene corresponding to the sequencing data of the data set of the established model;
determining a first correspondence between sequencing data of the T cells and tumor-specific T cells, if cells corresponding to the sequencing data of the T cells support annotation information identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
Training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model;
the method for extracting the sequencing data of the T cells from the data set of the established model based on the expression quantity of the Marker gene corresponding to the sequencing data of the data set of the established model comprises the following steps:
extracting first candidate sequencing data from the modeling dataset based on the expression quantity of Marker genes corresponding to the sequencing data of the modeling dataset;
and removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells.
2. The model training method for T cell subtype identification of claim 1, wherein the acquiring a pre-set model-built dataset comprises:
acquiring a preset candidate data set;
filtering the sequencing data of the candidate data set to obtain the data set of the established model;
wherein the filtering operation comprises the steps of:
removing sequencing data from the candidate data set, wherein the detected number of genes is smaller than a first threshold value;
Removing sequencing data with the number of specific molecular tags UMI less than a second threshold from the candidate data set;
removing sequencing data with the ratio of the mitochondrial gene expression amount of UMI greater than a third threshold value from the candidate data set;
and removing sequencing data corresponding to the double cells from the candidate data set.
3. The model training method for T cell subtype identification of claim 1, wherein the removing of T cell receptor genes and tissue dissociation-inducing genes from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells comprises:
removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain second candidate sequencing data;
and processing the second candidate sequencing data through a preset SCTransform algorithm to obtain sequencing data of the T cells.
4. The method for training a model for T cell subtype identification according to claim 1, wherein training a preset model to be trained by using the first correspondence and the second correspondence as training data to obtain a T cell subtype identification model comprises:
Setting parameters of a first candidate model which is preset through an extreme gradient lifting algorithm to obtain a preliminary identification model; wherein the parameters include at least one of: maximum depth of tree, learning rate and sampling percentage;
taking a preset logistic regression model as a classification model;
and obtaining the model to be trained based on the preliminary identification model and the classification model.
5. The method for training a model for T cell subtype identification according to claim 4, wherein the obtaining the model to be trained based on the preliminary identification model and the classification model comprises:
obtaining a second candidate model based on the preliminary identification model and the classification model;
and calculating target superparameters of the second candidate model through a preset 10-time cross validation algorithm, and optimizing the second candidate model based on the target superparameters to obtain the model to be trained.
6. A model training device for T cell subtype identification, comprising:
the acquisition module is used for acquiring a preset data set for establishing a model; wherein the modeled dataset comprises at least single cell sequencing data of tumor specific T cells;
The extraction module is used for extracting sequencing data of T cells from the data set of the established model based on the expression quantity of the Marker gene corresponding to the sequencing data of the data set of the established model;
a determining module, configured to determine a first correspondence between sequencing data of the T cells and tumor-specific T cells, in the case where cells corresponding to the sequencing data of the T cells support annotation information for identifying a tumor; determining a second correspondence between the sequencing data of the T cells and non-tumor specific T cells if the cells corresponding to the sequencing data of the T cells do not support annotation information identifying a tumor;
the training module is used for training a preset model to be trained by taking the first corresponding relation and the second corresponding relation as training data to obtain a T cell subtype identification model;
the extraction module is specifically configured to:
extracting first candidate sequencing data from the modeling dataset based on the expression quantity of Marker genes corresponding to the sequencing data of the modeling dataset;
and removing the T cell receptor gene and the tissue dissociation induction gene from the hypervariable genes of the first candidate sequencing data to obtain the sequencing data of the T cells.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the model training method of T cell subtype identification of any one of claims 1 to 5.
8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the model training method of T cell subtype identification of any one of claims 1 to 5.
CN202310708381.8A 2023-06-15 2023-06-15 Model training method, device, equipment and storage medium for T cell subtype identification Active CN116469473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310708381.8A CN116469473B (en) 2023-06-15 2023-06-15 Model training method, device, equipment and storage medium for T cell subtype identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310708381.8A CN116469473B (en) 2023-06-15 2023-06-15 Model training method, device, equipment and storage medium for T cell subtype identification

Publications (2)

Publication Number Publication Date
CN116469473A CN116469473A (en) 2023-07-21
CN116469473B true CN116469473B (en) 2023-09-22

Family

ID=87181055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310708381.8A Active CN116469473B (en) 2023-06-15 2023-06-15 Model training method, device, equipment and storage medium for T cell subtype identification

Country Status (1)

Country Link
CN (1) CN116469473B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104195227A (en) * 2008-11-07 2014-12-10 赛昆塔公司 Methods of monitoring conditions by sequence analysis
CN111276252A (en) * 2020-01-15 2020-06-12 北京吉因加科技有限公司 Construction method and device of tumor benign and malignant identification model
CN111315390A (en) * 2017-09-05 2020-06-19 磨石肿瘤生物技术公司 Novel antigen identification for T cell therapy
CN113160887A (en) * 2021-04-23 2021-07-23 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data
CN115798723A (en) * 2023-01-18 2023-03-14 北京泽桥医疗科技股份有限公司 Construction method of cancer recurrence risk prediction model
WO2023037164A2 (en) * 2021-09-10 2023-03-16 Immunoscape Pte Ltd Systems and methods for the identification of target-specific t cells and their receptor sequences using machine learning
CN115896242A (en) * 2022-11-25 2023-04-04 绵溢(河北雄安)生物科技有限公司 Intelligent cancer screening model and method based on peripheral blood immune characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104195227A (en) * 2008-11-07 2014-12-10 赛昆塔公司 Methods of monitoring conditions by sequence analysis
CN111315390A (en) * 2017-09-05 2020-06-19 磨石肿瘤生物技术公司 Novel antigen identification for T cell therapy
CN111276252A (en) * 2020-01-15 2020-06-12 北京吉因加科技有限公司 Construction method and device of tumor benign and malignant identification model
CN113160887A (en) * 2021-04-23 2021-07-23 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data
WO2023037164A2 (en) * 2021-09-10 2023-03-16 Immunoscape Pte Ltd Systems and methods for the identification of target-specific t cells and their receptor sequences using machine learning
CN115896242A (en) * 2022-11-25 2023-04-04 绵溢(河北雄安)生物科技有限公司 Intelligent cancer screening model and method based on peripheral blood immune characteristics
CN115798723A (en) * 2023-01-18 2023-03-14 北京泽桥医疗科技股份有限公司 Construction method of cancer recurrence risk prediction model

Also Published As

Publication number Publication date
CN116469473A (en) 2023-07-21

Similar Documents

Publication Publication Date Title
US20030017481A1 (en) Methods for classifying samples and ascertaining previously unknown classes
CN111009286A (en) Method and apparatus for microbiological analysis of host samples
CN108319813A (en) Circulating tumor DNA copies the detection method and device of number variation
CN108021788B (en) Method and device for extracting biomarkers based on deep sequencing data of cell free DNA
CN112086129A (en) Method and system for predicting cfDNA of tumor tissue
CN112289376B (en) Method and device for detecting somatic cell mutation
CN107208131A (en) Method for lung cancer parting
CN110910950A (en) Flow method for combined analysis of single-cell scRNA-seq and scATAC-seq
CN107849613A (en) Method for lung cancer parting
CN107463797B (en) Biological information analysis method and device for high-throughput sequencing, equipment and storage medium
CN111584064A (en) Colorectal cancer metastasis prediction system and application method thereof
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN116469473B (en) Model training method, device, equipment and storage medium for T cell subtype identification
CN116385441B (en) Method and system for risk stratification of oligodendroglioma based on MRI
CN116580768B (en) Tumor tiny residual focus detection method based on customized strategy
CN113862351A (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
CN109215736B (en) High-throughput detection method and application of enterovirus group
Liu et al. TSDLPP: a novel two-stage deep learning framework for prognosis prediction based on whole slide histopathological images
KR20190114351A (en) Methods for Identifying Microdeletion or Microamplification of Fetal Chromosomes Using Non-invasive Prenatal testing
CN110619926B (en) Analysis method and analysis system for recognizing all RNA (ribonucleic acid) cleavage sites
CN113918786A (en) Intelligent cell subtype judgment method
CN114037662A (en) Circulating tumor cell identification system based on random forest algorithm
CN112382341A (en) Method for identifying biomarkers related to esophageal squamous carcinoma prognosis
CN104450922A (en) Method for performing chromosome aneuploidy detection based on single cell amplification by using chromosome specific sites
Jakubiak et al. The spatial landscape of glial pathology and T-cell response in Parkinson’s disease substantia nigra

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant