CN111415707A - Prediction method of clinical individualized tumor neoantigen - Google Patents

Prediction method of clinical individualized tumor neoantigen Download PDF

Info

Publication number
CN111415707A
CN111415707A CN202010162494.9A CN202010162494A CN111415707A CN 111415707 A CN111415707 A CN 111415707A CN 202010162494 A CN202010162494 A CN 202010162494A CN 111415707 A CN111415707 A CN 111415707A
Authority
CN
China
Prior art keywords
tumor
sequencing
polypeptide
software
tumor cells
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010162494.9A
Other languages
Chinese (zh)
Other versions
CN111415707B (en
Inventor
许恒
舒洋
杨莉
丁振宇
魏于全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010162494.9A priority Critical patent/CN111415707B/en
Publication of CN111415707A publication Critical patent/CN111415707A/en
Application granted granted Critical
Publication of CN111415707B publication Critical patent/CN111415707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a prediction technology, solves the defects that the existing tumor neoantigen analysis can not adopt a one-step method to obtain the final result and improve the accuracy of prediction according to the feedback of clinical results, and provides a prediction method of clinical individualized tumor neoantigen, and the technical scheme can be summarized as follows: comparing the sequencing data of the tumor cells of the patient with the corresponding reference genome to obtain a DNA and RNA comparison result, analyzing somatic variation, clone type, tumor purity, gene expression quantity in the tumor cells and expression abundance of somatic variation alleles after preprocessing to obtain an analysis result, analyzing the affinity of the potential new antigens and calculating the presentation efficacy of the polypeptides, and scoring and sequencing each corresponding new antigen according to the analysis result, the affinity of the potential new antigens and the presentation efficacy of the polypeptides. The method has the advantages that the accuracy can be improved through a standardized process, the operability and the repeatability are more convenient, and the method is suitable for predicting clinical individualized tumor neoantigens.

Description

Prediction method of clinical individualized tumor neoantigen
Technical Field
The invention relates to a prediction technology, in particular to a prediction technology of individualized tumor neoantigens.
Background
Immunotherapy of tumors is considered as a new generation of tumor treatment following surgery, radiotherapy, chemotherapy and small molecule targeted therapy, which differs from the previous traditional treatment methods in that: immunotherapy changes the therapeutic means from the idea of directly killing tumor cells by drugs into activating immune cells, treats tumors by enhancing the anti-tumor immunity of patients, and has the advantages of accurate killing, small side effect, lasting curative effect, high personalization degree and the like compared with the traditional therapy. In addition, the immune system of the body has the characteristic of immunological memory, so that immunotherapy can help patients to form memory type immunity, and has remarkable advantages in preventing tumor recurrence and metastasis.
Current immunotherapies are mainly divided into two categories: immune checkpoint inhibitor-based immunotherapy and cellular immunotherapy. There are a number of drugs (including the PD-1 inhibitors of schrobo and merck) available on the market as immune checkpoint inhibitors, and significant success has been achieved in the treatment of advanced tumors, especially in patients who are resistant to traditional therapies. This therapy has been used as a first line treatment in the treatment of some tumors. However, the proportion of the population with the overall benefit of the therapy to the total patient population is about 10-30%, which means that a large number of patients still cannot benefit.
The mechanism of T cell specific recognition of Tumor cells is that genetic mutations distinguished from normal cells exist in Tumor cells, and the mutations are different from normal proteins after being translated into proteins, wherein polypeptides formed after ubiquitination and degradation of some mutant proteins are recognized by the immune system, and are expressed on the cell surface by a small number of antigenic Major Histocompatibility Complexes (MHC) of Tumor cells themselves, and are subsequently recognized as foreign substances by the immune system, so that Tumor cells are attacked by immune cells and killed by the immune cells, and the Tumor cells are also well-developed by a special Tumor cell-killing therapy method (Tumor cell-antigen recognition and Tumor cell differentiation) in the local Tumor cell culture, and are not well-activated by Tumor cells, thus the Tumor cells are not fully cultured by Tumor cells, and are not fully developed by Tumor cell-Tumor cell transplantation, and are not fully developed by Tumor cell-specific Tumor cell-killing therapy, and are not fully developed by Tumor cell-activated Tumor cell-cells, and are not fully developed by Tumor cell-specific Tumor cell-Tumor-cell-Tumor-cell-antigen-cell-specific Tumor-cell-specific-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-antigen-cell-.
Although the international large-scale tumor genome project (TCGA) has found a large number of somatic mutations of different tumors and drawn a mutation map, the analysis and verification of new antigens in the mutation map have been developed, the research is particularly remarkable in that the tumor mutation map obtained based on the differences between the western and eastern races and the characteristics of the huge heterogeneity of tumors may not mutate or have low mutation frequency in the patients in China, and the site belonging to the high-frequency mutation in China may not be reported in the western population, and in addition, the key determinant determining whether the mutation has immunogenicity, namely the H L A site mutation, also has huge difference in different races, which suggests that the research and analysis of race specificity of the Chinese population needs to be carried out.
To actively find a new tumor antigen, two conditions must be met: the first is that high-throughput DNA sequence detection can be performed on each individual patient tumor genome to search for all tumor-specific mutations; on the other hand, the antigenicity prediction is carried out on the protein and the polypeptide expressed by the mutation, the potential new antigen peptide segment is synthesized in vitro (the peptide segment is also called tumor vaccine), and finally the high-concentration and high-intensity tumor vaccine is used for activating immune cells. At present, the new antigen immunotherapy aiming at the tumor has better theoretical and practical basis, which is mainly reflected in the following aspects:
1. with the development of the second generation sequencing technology and bioinformatics, short-time large-scale sequencing and analysis become possible, the whole scanning of human genome only needs a week at present, and particularly, the whole exon sequencing can be carried out only aiming at a gene coding region, so that the difficulty of sequencing quantity and analysis is greatly reduced, and the process of identifying all mutations in a tumor becomes simpler and simpler.
2. The studies have found that polypeptides (also called epitopes) formed by certain variant proteins in tumors after degradation are specifically recognized by MHC molecules and form MHC-antigenic peptide complexes, which complexes are then presented on the surface of antigen presenting cells and recognized by T cells as non-self components, so that T cells activate effector T cells to attack and eliminate tumor cells, wherein the most critical step is the binding of MHC molecules to polypeptide molecules, MHC molecules in humans are also known as human leukocyte antigens (H L a), mainly MHC class I molecules and MHC class II molecules, the former being expressed by most cells and mainly involved in the presentation of endogenous antigens, while the latter being mainly expressed by antigen presenting cells and the latter being involved in the presentation of mainly exogenous antigens, wherein three genes encoding MHC I (H L a-A, H L a-B and H L a-C) and three genes encoding MHC 56A, DQ 56A, the more strongly different from DR-L a, DP, have found that the same affinity of these genes is not as a strong for recognizing MHC-antigen-5954A.
3. Until now, scientists in the related art have constructed a plurality of epitope databases and developed a plurality of epitope prediction-related software.
The IEDB database is developed by Peter et al to provide prediction and related annotation for users, including prediction of epitope binding, prediction of antigenic epitope based on peptides present in cells, and prediction of relative ability of polypeptide/MHC complexes to elicit immune responses, etc. the current mainstream methods of antigen prediction are to predict antigenic peptides bound by MHC molecules, which can be classified into three categories based on the characteristics used in prediction, prediction based on sequence information, prediction based on structural information, and prediction based on both sequence information and structural information, the MHC I, II molecules bind to antigenic peptides with their structures located at the far membrane end of the molecule, wherein the MHC I molecule groove is closed, the antigenic peptides bound thereto are 8-10 amino acid residues in length, while the MHC II molecule groove is open, the antigenic length bound thereto is longer than that of MHC I molecules, 13-17 amino acid residues, thus the antigenic peptides bound to MHC I molecules are predicted to occur more accurately than those bound to MHC II molecules, and the prediction of antigenic peptides with their amino acid residues is based on learning of the amino acid sequences of the MHC I motif, and SMM models.
4. The latest four independent studies show that the new antigen predicted by the prediction method combining sequence information and structural information can be used as a tumor vaccine to treat patients and has good effect.
These successful cases suggest that immunotherapy with new antigens has a promising application. However, the method is very individual at present, and the analysis process from sequencing to final selection of new antigens needs to be carried out on each individual in a complicated step and complicated process, and especially, experts in genetics and bioinformatics need to participate. In reported clinical studies, such procedures were acceptable due to the low number of patients enrolled. However, if such successful treatment regimens are followed by routine clinical treatment, most hospitals lack the professional responsible for selection of neoantigens, not only because selection of neoantigens delays treatment time, but also because standardization is difficult in different hospitals, which may cause great differences in the final treatment effect. Therefore, the large-scale spread of immunotherapy based on new antigens requires a standardisable procedure, such as software and analytical procedures using one-step methods. On the other hand, most of the clinical studies are based on the European and American population, and due to the existence of the aforementioned species differences, a species-specific optimization process is required for the prediction process and analysis of novel antigens.
Disclosure of Invention
The invention aims to overcome the defects that the final result cannot be obtained by adopting a one-step method in the conventional tumor neoantigen analysis and the accuracy of prediction is improved according to the feedback of clinical results, and provides a clinical individualized tumor neoantigen prediction method.
The invention solves the technical problem, adopts the technical scheme that the method for predicting the clinical individualized tumor neoantigen comprises the following steps:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing a DNA comparison result and an RNA comparison result;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation allele by using the DNA comparison result and the RNA comparison result to obtain an analysis result;
step 4, predicting the genotype of MHC class I molecules of the tumor cells of the patient, analyzing the affinity of potential new antigens and calculating the presenting effect of the polypeptide according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient;
and 5, scoring and sequencing each corresponding new antigen according to the analysis result, the affinity of the potential new antigen and the presenting effect of the polypeptide, and presenting.
Specifically, in the step 1, in the comparison of the sequencing data of the tumor cells of the patient with the preset corresponding reference genome, a Burrows-Wheeler transformation algorithm is adopted to obtain a DNA comparison result; the RNA alignment was obtained using the STAR algorithm.
Further, before step 1, the method further comprises the following steps:
and 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
Specifically, in step 0, the preprocessing includes trimming a sequencing linker sequence, trimming a sequencing tag sequence, and removing a sequence with poor sequencing quality, which may be performed by using trimmatic software.
Still further, in step 2, the preprocessing includes removing duplication, adding group information and base mass weight calculation, which can be performed by GATK software.
Specifically, the step 3 comprises the following steps:
step 3A, analyzing somatic mutation sites in tumor cells by using a DNA comparison result;
step 3B, analyzing according to the comparison result of the somatic mutation site and the RNA to obtain the expression abundance of the somatic mutation allele and the gene expression quantity in the tumor cells;
step 3C, filtering false positive sites in the somatic mutation sites through the expression abundance of the somatic mutation alleles to obtain actual somatic mutation sites;
step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating a copy number variation file of the tumor;
step 3F, calculating the purity and the clone structure of the tumor according to the somatic mutation and the copy number mutation file of the tumor;
and 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
Further, in step 3D, the somatic mutation is: further filtering was performed against the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon regions.
Specifically, step 3A adopts Mutect2 software to analyze, step 3B adopts Haslotypercalerer software of GATK to analyze, step 3C adopts FilterMutectCalls software to analyze, step 3D adopts ANNOVAR software to annotate, step 3E adopts cnvkit software to calculate, step 3F adopts ABSOU L TE software to calculate the tumor purity, the clone structure adopts pyClone software to calculate, and the clone structure refers to the mutant gene clone ratio, namely the clone ratio of each somatic mutation site.
Further, in step 3G, the analysis of the gene expression level refers to analyzing with featurepopulations software to obtain counts values, and converting the obtained counts values into the number of transcript sequencing fragments (TPM number) per million sequencing fragments.
Specifically, in the step of converting the obtained counts value into the number of transcript sequencing fragments in each million sequencing fragments, the conversion formula is as follows:
Figure BDA0002406292640000051
wherein R is the count number of the gene, L is the length of the gene,
Figure BDA0002406292640000052
Rknumber of counts for Gene k, LkFor the length of gene k, n is the total number of genes.
Still further, step 4 comprises the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation site;
step 4C, predicting the ability of the MHC class I molecules to combine with wild type polypeptide and variant polypeptide according to the non-synonymous mutein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the dissociation efficiency, the transport efficiency and the MHC combination efficiency of the polypeptide, and calculating the presentation efficiency of the polypeptide according to the obtained dissociation efficiency, transport efficiency and MHC combination efficiency of the polypeptide;
and 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
Specifically, step 4A adopts polSover software to analyze, step 4C adopts MHC class Iepitope predictors provided by IEDB to analyze, and step 4D adopts netCT L pan software to analyze and calculate.
Further, step 4B specifically includes: downloading a human genome protein sequence fasta file provided by a UCSC website, extracting sequences of 15 amino acids before and after a variant amino acid according to the annotated actual somatic mutation site, respectively obtaining flanking amino acid sequences of wild type protein and mutant protein, and combining all the sequences into the fasta file of a polypeptide sequence, namely non-synonymous mutant protein data.
Specifically, in step 5, the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen, and the polypeptide presentation efficacy comprises:
score=abundance*dissimilarity*clonality
wherein the content of the first and second substances,
abundance=L(ICm)*Freq*tanh(TPM);
dissimilarity=(1–L(ICw/50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+ex) Tanh () is hyperbolic tangent function calculation, score is score obtained by new antigen calculation, abundance is abundance of mutant polypeptide, TPM is gene expression abundance, Dissimilarity is Dissimilarity (Dissimilarity) which measures the difference of affinity between mutant peptide segment and corresponding normal peptide segment, clonality (clonality) measures the efficiency NCT L of new antigen from mutation to short peptide and the proportion MC of new antigen distribution in all tumor cells, Freq is somatic cell variation allele expression abundance ratio which is the ratio of expression abundance of somatic cell variation allele to gene expression in tumor cells, IC is ICmFor mutant polypeptide affinities, IC, obtained during affinity predictionwFor the wild-type polypeptide binding affinity obtained at the time of affinity prediction, NCT L was presentation efficacy and MC was mutant gene cloning ratio.
Specifically, the method further comprises the following steps:
and 6, optimizing the analysis method of the affinity of the potential new antigen in the step 4 by using a machine learning method based on the experimental result.
The invention has the beneficial effects that: in the scheme of the invention, by adopting the prediction method of the clinical individualized tumor neoantigen, an automatic and standardized neoantigen prediction platform is established by utilizing the existing mature high-throughput analysis software. In the method, a plurality of steps from original data alignment to potential new antigen sequencing are organically integrated, all potential new antigen information of a patient can be analyzed and obtained through a simple computer command, and polypeptide sequence information which can be clinically used for new antigen immunotherapy is provided; although the data are processed and analyzed based on the existing open source software method, the normalized data analysis process is relatively independent relative to software packages, needs to be integrated after respective operation, and has complex dependency relationship among the software.
The updating speed of the currently released affinity prediction database is low, and no relevant data is included for certain specific H L A genotypes, and the affinity prediction database is expanded according to the continuously accumulated data clinically, so that a more accurate prediction result can be obtained, which is also an important breakthrough of the method.
The method of the invention integrates a plurality of high-throughput sequencing data analysis software and antigen affinity analysis software to obtain a standardized analysis process. Through the running test of various conditions, the prediction process has better code robustness.
The selected polypeptide with the top rank is used for detecting the specificity of the polypeptide through E L ISPOT after clinical application, the result is consistent with the prediction result of the novel antigen analysis process, the prediction accuracy rate reaches more than 50 percent (the positive rates of two detected patients are 10/13 and 8/12, wherein the positive rates of the tumor specific strong antigens are 6/13 and 5/12), and the prediction accuracy rate is obviously higher than other related reports.
Detailed Description
The technical solution of the present invention will be described in detail with reference to the following examples.
The prediction method for clinical individualized tumor neoantigen sequencing comprises the following steps:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing a DNA comparison result and an RNA comparison result;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation allele by using the DNA comparison result and the RNA comparison result to obtain an analysis result;
step 4, predicting the genotype of MHC class I molecules of the tumor cells of the patient, analyzing the affinity of potential new antigens and calculating the presenting effect of the polypeptide according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient;
and 5, scoring and sequencing each corresponding new antigen according to the analysis result, the affinity of the potential new antigen and the presenting effect of the polypeptide, and presenting.
In the step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome, and obtaining a DNA comparison result by adopting a Burrows-Wheeler transformation algorithm; the RNA alignment was obtained using the STAR algorithm.
Before step 1, the method can also comprise the following steps:
and 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
In step 0, the preprocessing includes trimming sequencing linker sequences, trimming sequencing tag sequences, and removing sequences with poor sequencing quality, which can be performed by using trimmatic software.
In step 2, the pretreatment may include removing duplication, adding group information, and calculating the base mass weight, which may be performed by using GATK software.
Step 3 may comprise the steps of:
step 3A, analyzing somatic mutation sites in tumor cells by using a DNA comparison result;
step 3B, analyzing according to the comparison result of the somatic mutation site and the RNA to obtain the expression abundance of the somatic mutation allele and the gene expression quantity in the tumor cells;
step 3C, filtering false positive sites in the somatic mutation sites through the expression abundance of the somatic mutation alleles to obtain actual somatic mutation sites;
step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating a copy number variation file of the tumor;
step 3F, calculating the purity and the clone structure of the tumor according to the somatic mutation and the copy number mutation file of the tumor;
and 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
In step 3D, the somatic mutation may be: further filtering was performed against the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon regions.
Preferably, step 3A adopts Mutect2 software for analysis, step 3B adopts Haslotypercalerer software of GATK for analysis, step 3C adopts Filter MutectCalls software for analysis, step 3D adopts ANNOVAR software for annotation, step 3E adopts cnvkit software for calculation, and step 3F adopts ABSOU L TE software for calculation, and the clone structure adopts pyClone software for calculation, wherein the clone structure generally refers to the clone ratio of the mutant gene, namely the clone ratio of each somatic mutation site.
In step 3G, the analysis of the gene expression level may be performed by using featurepopulations software to obtain counts values, and converting the obtained counts values into the number of transcript sequencing fragments (TPM values) per million sequencing fragments.
In the step of converting the obtained counts value into the number of transcript sequencing fragments in each million sequencing fragments, the conversion formula can be as follows:
Figure BDA0002406292640000081
wherein R is the count number of the gene, L is the length of the gene,
Figure BDA0002406292640000082
Rknumber of counts for Gene k, LkFor the length of gene k, n is the total number of genes.
Step 4 may include the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation site;
step 4C, predicting the ability of the MHC class I molecules to combine with wild type polypeptide and variant polypeptide according to the non-synonymous mutein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the dissociation efficiency, the transport efficiency and the MHC combination efficiency of the polypeptide, and calculating the presentation efficiency of the polypeptide according to the obtained dissociation efficiency, transport efficiency and MHC combination efficiency of the polypeptide;
and 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
Step 4A adopts polsor software to analyze, step 4C adopts MHC class Iepitope predictors provided by IEDB to analyze, and step 4D adopts netCT L pan software to analyze and calculate.
Step 4B may specifically be: downloading a human genome protein sequence fasta file provided by a UCSC website, extracting sequences of 15 amino acids before and after a variant amino acid according to an annotated actual somatic mutation site, respectively obtaining flanking amino acid sequences of a wild type protein and a mutant protein, and combining all the sequences into a fasta file of a polypeptide sequence, namely non-synonymous mutant protein data.
In step 5, the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen and the polypeptide presentation efficacy is preferably:
score=abundance*dissimilarity*clonality
wherein the content of the first and second substances,
abundance=L(ICm)*Freq*tanh(TPM);
dissimilarity=(1–L(ICw/50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+ex) Tanh () is hyperbolic tangent function calculation, score is score obtained by neoantigen calculation, abundance is abundance of mutant polypeptide, TPM is gene expression abundance, Dissimilarity is Dissimilarity (Dissimilarity) which measures the difference of affinities of mutant peptide segment and corresponding normal peptide segment, clonality (clonality) measures the efficiency NCT L of neoantigen from mutation to short peptide and the proportion MC of neoantigen distribution in all tumor cells, Freq is proportion of somatic cell variant allele expression abundance which refers to the ratio of somatic cell variant allele expression abundance to gene expression in tumor cells, IC is the ratio of IC to gene expression abundance in tumor cellsmFor mutant polypeptide affinities, IC, obtained during affinity predictionwFor the wild-type polypeptide binding affinity obtained at the time of affinity prediction, NCT L was presentation efficacy and MC was mutant gene cloning ratio.
The method can also comprise the following steps:
and 6, optimizing the analysis method of the affinity of the potential new antigen in the step 4 by using a machine learning method based on the experimental result.
Examples
In this example, the specific prediction method for clinical individualized tumor neoantigen sequencing is as follows:
and 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
The method specifically comprises the following steps: after obtaining a high-throughput sequencing file (fastq file) of original tumor cell DNA/RNA of a patient and DNA materials of blood of the patient, trimming a sequencing linker sequence, trimming a sequencing tag sequence and removing poor sequencing quality on the sequencing fastq file by using Trimmomatic software.
Wherein the parameters are selected as 1) selecting the PE mode, i.e. paired-end sequencing (pair-end) mode for data analysis, 2) I LL UMINAC L IP parameter set to TruSeq3-PE.fa:2:30:10, even if linker sequence information in TruSeq3-PE pooling scheme of I LL UMINA is used for linker removal, which information is present in TruSeq3-PE.fa file and the target sequence cannot be aligned more than 2 bases with the source sequence, the palindromic sequence alignment split threshold is set to 30 (if the library insert is shorter than the sequencing read length, a stretch of bases in forward and reverse sequencing fragments can be completely reverse-complementary, two linker sequences are aligned with the sequencing fragments, while two paired sequencing fragments are aligned with each other, the linker sequences at the ends can be accurately removed, the parameters controlling alignment of the palindromic structure between two paired sequencing fragments to be performed with such a sequencing method, the alignment between the two paired sequencing fragments is performed with each other, the linker fragments, the linker sequence at the ends of the linker sequences can be accurately removed, the parameters are set to 354, the last nucleotide sequence alignment parameter is set to be less than the optimal sequencing primer length of the sequencing fragment 35, the sequencing fragment is set to be equal to 10, the lowest nucleotide sequence alignment parameter is set to be equal to 10, the sequencing copy removal of the sequencing copy number of the sequencing fragment 35, the sequencing fragment, the sequencing copy number of the sequencing fragment, the sequencing fragment is set to be equal to 10, the sequencing copy number of the sequencing fragment, the sequencing fragment is set to be equal to 10, the sequencing fragment is set to less than the sequencing fragment, the sequencing parameter is set to be equal to a sequencing length of sequencing window, the sequencing fragment length of sequencing window, the sequencing.
And (3) the Trimmomatic program takes the original fastq file as an input file, and runs to obtain a processed fastq file, namely a clean fastq file, which is used for the comparison operation of the sequencing fragments in the subsequent step 1.
Step 1, comparing the sequencing data of the tumor cells of the patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result.
The method specifically comprises the following steps: step 0 has obtained a purified fastq file (i.e., clean fastq file), and these sequenced fragments are then aligned to the corresponding reference genome. Considering the difference between the whole exon sequencing and the whole transcription database building method, BWA software is adopted to perform genome alignment of whole exon data, STAR software is adopted to perform genome alignment of whole transcription data:
important parameters for using BWA are: 1) the k parameter is set to 19, i.e. the shortest length of the seed sequence is 19; 2) setting the w parameter as 100, and if the length of the comparison gap is more than 100, neglecting; 3) (ii) a 4) The r parameter is set to 1.5 and if the seed length exceeds k r, i.e. 19 x 1.5-28.5 bp, a new seed sequence is found inside the seed sequence. The parameter is a key parameter for regulating the performance of the algorithm by a heuristic algorithm, if the value is set to be too large, the comparison speed can be higher, but the accuracy of a part can be sacrificed; 5) c, setting a parameter to be 500, and if the repetition times of the seed sequence exceed 500, omitting the seed sequence; 6) chimSegmentMin is set to be 0, only correctly aligned sequencing fragments are output, and chimeric fragments are not output;
using STAR parameters of 1) setting a parameter genoMeDir to a directory path to a storage reference genome file, 2) setting a parameter readFilesIn to a path of an input fragment file, 3) setting a parameter twopssMode to a Basic (Basic) mode, a channel mapping mode, i.e. using a two-channel alignment mode, 4) setting a parameter outTreadUnmapNone to None, i.e. not outputting fragments that fail or partially succeed in alignment, 5) setting a parameter chinensis Min to 12, i.e. having a minimum length of a fragment of 12bp, 6) setting a parameter chimJUNCTIONOverhangMin to 12, i.e. having a minimum allowable overhang length of 12bp AT a mosaic binding site, 7) setting a parameter alignDBoverMin to 10, i.e. having a minimum operational overhang of 10bp AT a binding site with a releasable binding site, i.e. having a maximum allowable overhang 10bp AT a prefix length of a fragment of 10bp AT a prefix length of 10bp, 8) setting a parameter alignMasemat 0 to a parameter for a matching between a prefix No more than 10 and No more than 10 × 1 × 10 × 7) setting a parameter for a prefix match with a prefix match between a prefix of a prefix hit map file, setting a prefix match parameter MaseagePatchSeedMax to a prefix map file, setting a prefix length of a prefix map file, setting a prefix map to a prefix map file, setting a prefix length of a prefix map to a prefix map file, and a prefix length of a prefix map file, setting a prefix length of a prefix map to a prefix map file, and a prefix length of a prefix map for a prefix map file, setting a prefix length of a prefix map to generate a prefix map for a prefix map file, or a prefix map for a match, setting a prefix map to generate a no more than a prefix map for a prefix map file, and a prefix map for a match, and a match, setting a prefix map for a no more than a match, setting a prefix length of a prefix map for a match between a match, or a prefix map for a match between a prefix map for a prefix map file, a prefix map of a prefix map file, or for a match, or.
Through the steps, a sequencing sequence file, namely a BAM file, which is a DNA alignment result and an RNA alignment result, aligned to the selected reference genome is obtained.
And 2, preprocessing the DNA comparison result and the RNA comparison result.
The method specifically comprises the following steps: after sequencing fragments are aligned to a reference genome by using BWA or STAR, the alignment results (including DNA alignment results and RNA alignment results) need to be subjected to tasks such as deduplication, group information addition, and base mass recalculation. These steps can be performed using GATK.
The parameters for sequencing alignment using GATK to remove duplicates for a fragment are 1) setting the PG parameter to MarkDuplicates to enable the program record ID. that the PG record creates to have potentially suffixes appended to this string to avoid collisions with other program record IDs, 2) setting the PG _ NAME parameter to MarkDuplicates, i.e., the value of the PN flag that the PG record was created, 3) setting MAX _ FI L E _ HAND L ES to 8000, i.e., the maximum number of file handles that remain open when the end of the fragment spills to disk is 8000, 4) setting the parameter SORTING _ CO LL choice _ SIZE _ RATIO to 0.25, i.e., determining that certain sort sets use memory footprint of 0.25 times the total memory capacity, 5) setting the parameter sottiptica L DUP L PIXE L DISTANCE disnce to 100, i.e., setting the maximum offset between two duplicate clusters to 100.
The parameters for sequencing using GATK versus addition of fragment group information were: 1) the parameter ID is set to 1, i.e. the reading group ID default is set to 1; 2) the parameter version is set to false, i.e. the version number of the software is not displayed.
The parameters for mass recalculation of bases in the sequenced fragments using GATK were: 1) setting the parameter R to hg19 reference genome fasta file; 2) the knock-sites parameter was set to 1000G _ phase1. indexes. hg19.sites. vcf. gz, Mills _ and _1000G _ gold _ standard. indexes. hg19.sites. vcf. gz and dbsnp150, and the quality of the bases in the BAM file was re-evaluated using these known variations as references.
Through the steps, a quality-controlled BAM file is obtained, and the file is used for the subsequent steps 3A, 3E, 3G and 4A.
And 3, analyzing the somatic variation, the clone type, the tumor purity, the gene expression quantity in the tumor cells and the expression abundance of somatic variation alleles by utilizing the DNA comparison result and the RNA comparison result to obtain an analysis result.
The method comprises the following specific steps:
and 3A, analyzing somatic mutation sites in the tumor cells by using a DNA comparison result.
Processing the BAM file in the step 2 to obtain an alignment result BAM file of the tumor sample and the normal control sample on the genome, and then performing comparative analysis on the BAM file of the normal-tumor paired sample through mutect2 software to find out all potential mutation sites.
The parameters using mutect2 were: 1) setting the parameter base-quality-score-threshold to 18, i.e. setting the matric mass fraction threshold to 18, bases below this threshold will drop to a minimum value of 6; 2) setting the parameter heterozygosity to 0.001, i.e. setting the prior likelihood value of heterozygosity of any locus to 0.001; 3) setting indel-heterozygosity to 1.25E-4, i.e., setting the prior likelihood value of heterozygosity at any locus to 1.25E-4; 4) setting parameter mbq to 10, i.e., the lowest quality of qualified bases allowed when searching for mutations is 10; 5) setting the parameter-pair-hmm-threads as 4, namely the thread number used by the PairHMM of the machine is 4; 6) setting the parameter minimum-mapping-quality to 4, i.e. fragments with an alignment quality below 20 are discarded; 7) the reference min-read-length is set to 30, that is, only segments with the length greater than or equal to 30 are reserved.
Through the steps, a vcf file containing information of somatic mutation sites is obtained, and the file is used in the subsequent steps 3B and 3C.
And 3B, analyzing according to the comparison result of the somatic mutation site and the RNA to obtain the expression abundance of the somatic mutation allele and the gene expression quantity in the tumor cells.
The method specifically comprises the following steps: the vcf file storing the information of the somatic mutation sites is converted into the bed file of the site coordinates by using vcf2bed software, and the information of the sites in the file is analyzed by using a Haplotpypercaler of GATK.
The method comprises the steps of using parameters of a Haplotpypercaler to provide a bed file generated by vcf2bed for 1-L, using GRC37/hg19 for reference of a genome for 2) -reference, selecting a GVCF format by ERC to output a GVCF file, and selecting EMIT _ A LL _ SITES by using output-mode to output all SITES in a target region whether the mutation is judged or not.
Next, the gvcf was genotyped using the genotypeGVCFs command of GATK, and the number of wild type bases and the number of mutant bases at the RNA level of the somatic mutation site were obtained.
Through the steps, the expression abundance of the somatic mutation allele and the gene expression amount in the tumor cells are obtained and are used in the step 3C and the step 5.
And 3C, filtering false positive sites in the somatic mutation sites through the expression abundance of the somatic mutation alleles to obtain actual somatic mutation sites.
The method specifically comprises the following steps: after step 3A, we can find out the somatic mutation sites of the tumor cells different from the normal cells, but many false positive sites caused by sequencing technology and the like are doped in the somatic mutation sites, and then in the step, the false positive sites are filtered by using Filter MutectCalls software, so that the variant sites with RNA expression are further screened by combining the expression abundance of the cellular variant allele.
The parameters of the FilterMutectCals software are 1) the parameter base-quality-score-threshold is set to 18, i.e., the base quality below the threshold 18 will be reduced to the minimum value of 6, 2) the parameter heterozygosity is set to 0.001, i.e., the prior likelihood of any locus heterozygosity is set to 0.001, 3) the parameter heterozygosity-STDEV is set to 0.01, i.e., the standard deviation value of heterozygosity of SNP and indel is set to 0.01, 4) the parameter index-heterozygosity is set to 1.25E-4, i.e., the prior likelihood of any locus is set to 1.25E-4, 5) the parameter max-algorithm-count is set to 1, i.e., the mutant site with multiple alleles deleted, 6) the parameter threshold-relationship-score-probability is set to 0.1, i.e., the maximum allele-count is set to 1, i.e., the mutation-probability of 0.7, the parameter threshold is set to 0.7.10), the parameter threshold is set to 0.7.e., the median-average-score-10, the parameter is set to 0.7, the parameter of the median-10-mutation-10, and the parameter is set to 0.e., the parameter of the median-10-median-probability of mutation-10-mutation is set to 0.7-10, the median-10-mutation-10-mutation-10-7-10-7-10-median-mutation-7-10-point of the mutation-7-10-7-10-7-10-mutation-7-mutation-10-mutation-10-7-mutation-7-3) the mutation-3) the parameter.
Through the step, the filtered somatic mutation vcf file is obtained and is used in the subsequent step 3D.
And 3D, annotating the actual somatic mutation site and further filtering to obtain the somatic mutation.
The method specifically comprises the following steps: the function of the actual somatic mutation sites was annotated using the ANNOVAR software, which was used with the parameters: 1) setting the parameter buildver as hg19, namely using a database constructed by using the gene of hg19 as a reference; 2) the parameter maxgenethread is set to 6. The maximum number of threads allowed in the assigned gene annotation was 6; 3) the reference refGene, avsnp150, clinvar, cosmic,1000gEAS, EXAC, gnomad, icgc, dbnsfp33 was used for the annotation. The mutation sites were further filtered according to the annotation results, leaving only non-synonymous mutation sites in the exon regions.
Through the step, a vcf file containing non-synonymous mutation site annotation information of the exon regions, namely somatic mutation, is obtained and is used in the subsequent steps 3F and 4B.
And 3E, calculating a copy number variation file of the tumor.
The method specifically comprises the following steps: to estimate the purity of the tumor, a copy number variation file (seg file) of the tumor needs to be prepared. To calculate the somatic copy number variation of tumors, tumor and paracarcinoma bam files were analyzed using the cnvkit software with specific parameters: 1) setting the parameter method to hybrid, i.e. the sequencing protocol to hybrid capture; 2) the parameter process is set to process reach BAM in serial, i.e., each BAM is processed in parallel.
Through this step, seg files containing tumor copy number variation were obtained for the subsequent step 3F.
And 3F, calculating the purity and the clone structure of the tumor according to the somatic mutation and the copy number mutation file of the tumor.
The step can specifically comprise the following 3 parts:
I. get MAF File
In order to estimate the purity of a tumor, not only a copy number variation file (seg file) of the tumor but also a MAF file containing non-synonymous mutation sites of exon regions need to be prepared, and after obtaining a vcf file (i.e., the vcf file obtained in step 3D) containing annotation information of the non-synonymous mutation sites of exon regions, the vcf file is used for estimating the purity of the tumor, firstly, vcf2MAF software needs to be used to convert the vcf file into the MAF file, and specific use parameters of the vcf2MAF software are as follows: 1) setting a parameter VEP-forks to be 4, namely, the number of parallel travel processes used when the VEP is operated is 4; 2) the parameter NCBI-build was set to GRCh37, i.e., using NCBI's GRCh37 as a reference, to assemble MAFs at the mutated sites.
Through this section, a MAF file containing non-synonymous mutation sites in exon regions was obtained for subsequent sections II, III.
II. Calculating tumor purity
In the above section I and step 3E, a MAF file containing non-synonymous mutation sites of exon regions and a copy number variation file (seg file) of a tumor are obtained, respectively, in which section the calculation of tumor purity is performed using ABSO L UTE software using these two files as input files, the relevant parameters of ABSO L UTE software are 1) setting the parameter sigma p to 0, i.e. the temporary value beyond the sample level variance for pattern search is set to 0, 2) setting the parameter max sigma h to 0.015, the maximum value beyond the sample level variance is set to 0.015, 3) setting the parameter min ploidy to 0.95N, specifying the minimum ploidy value N to be considered by the algorithm to 0.95N, 4) setting the parameter max ploidy to 10N, specifying the maximum ploidy value N to be considered to 10N, and discarding models of possibly larger ploidy values, 5) setting the parameter copy distance diseasea to the corresponding sample name 'to be input of a universal mutation site' to 10N, and the expected genomic mismatch value of the highest ploidy value of the genomic mismatch of the sample is set to 0.005N, and if the expected genomic mismatch of the non-homologous mutation point is less than 0.005, the highest nominal score of the relevant parameters of the genomic segment No. 0, 1. the genomic segment, the genomic segment can be discarded, and if the parameter No. 1. the relevant parameters are set to be considered to be less than the relevant parameters, the relevant parameters of the genomic segment can be discarded, or more than the relevant parameters of the relevant.
From this fraction, the purity values of the tumor tissue were obtained for the subsequent fraction III.
III, analysis of tumor clone Structure
The clone status of the tumor can be analyzed according to the MAF file containing the non-synonymous mutation sites of the exon regions and the copy number variation file (seg file) of the tumor while calculating the tumor purity, and the clone status of the tumor can be analyzed by using the MAF file containing the non-synonymous mutation sites of the exon regions, the copy number variation file (seg file) of the tumor and the purity value of the tumor tissue as input files and using pyClone software, wherein the pyClone software uses specific parameters of 1) using run _ analysis _ pipeline sub-command, 2) using files obtained according to MAF arrangement as the in _ files parameter, 3) using total _ copy _ number as the primer parameter, and 4) using ABSO L UTE as the tumor purity value calculated by using TUMOUR _ contents.
Through this section, a cloning ratio file of all mutation sites, i.e., a tumor cloning structure, is obtained for the subsequent step 5.
And 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
The method specifically comprises the following steps: the results of the RNA-seq alignment (contained in the quality-controlled BAM file obtained in step 2) were analyzed using featureCounts software to obtain a gene expression value FPKM (counts values), and the results were further converted into TPM values.
The parameters using the featurepoints software were: 1) the parameter minOverlap is set to 1. Fragment assignment the minimum number of overlapping bases of the used fragments is 1; 2) the parameter T is set to 8, i.e. run in parallel using 8 thread numbers.
The conversion formula is as follows:
Figure BDA0002406292640000151
wherein R is the count number of the gene, L is the length of the gene,
Figure BDA0002406292640000152
Rknumber of counts for Gene k, LkFor the length of gene k, n is the total number of genes.
The file containing all the gene expression levels, i.e., the gene expression levels, obtained by this step is used in step 5.
And 4, predicting the genotype of the MHC I molecules of the tumor cells of the patient, analyzing the affinity of the potential new antigen and calculating the presenting effect of the polypeptide according to the DNA analysis result and the genotype of the MHC I molecules of the tumor cells of the patient.
The method comprises the following specific steps:
and 4A, predicting the genotype of MHC class I molecules of tumor cells of the patient.
The H L A allele is presumed by using polysover software, and relevant parameters are as follows, 1) the BAM parameter is set as the BAM file obtained in the step 2, 2) the race parameter is set as Asia, namely the race of a patient is set as Asian (European or African and the like as occasion demands), 3) the includeFreq parameter is set as 1, namely the allele frequency of the group level is used as a priori condition, 4) the build parameter is set as hg19, namely the comparison of sequencing data and a gene group is GRCh37/hg19, 5) the format parameter is set as STDFQ, namely the sequencing data adopts Hiseq, the returned sequencing data is a standard fastq file, and 6) the insertCalc parameter is set as 1, namely the size distribution model is inserted empirically in the model.
Through this step, the typing information of H L A in the tumor, i.e., the genotype of MHC class I molecules of the patient's tumor cells, was obtained for the subsequent step 4C.
And 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation site.
The method specifically comprises the following steps: and (4) screening the somatic mutation obtained in the step (3D), and reserving non-synonymous mutation. Meanwhile, a human genome protein sequence fasta file provided by a UCSC website is downloaded, 15 amino acid sequences before and after a variant amino acid are extracted according to a variant position in an annotation file of somatic variation, and flanking amino acid sequences of a wild type and a mutant type are respectively obtained. All sequences were combined into a fasta file of polypeptide sequences.
Through this step, a protein fasta file containing polypeptide information of somatic mutation sites, i.e., non-synonymous mutein data, was obtained for the subsequent steps 4C and 4E.
And 4C, predicting the ability of the MHC class I molecules to combine with wild polypeptides and variant polypeptides according to the non-synonymous mutein data to obtain affinity prediction.
Specifically, the ability of MHC class I molecules to bind polypeptides was predicted using MHC _ I software available on the IEDB website with the relevant parameters of 1) the method parameter set to IEDB _ recommended, i.e., using the IEDB recommended settings (including ann, netmhpan, pickpocket, etc.), 2) the MHC parameter set to H L A typing results from polysolver, 3) the peptide _ length set to the length of potential new antigen polypeptides, which we typically set to 8, 9, 10, 11, 12,13,14, 4) the input _ file parameter to the fastq file obtained from step 13.
Through this step, the affinity of all potential polypeptides to each H L a molecule was obtained, i.e. an affinity prediction, for the subsequent step 4E.
And 4D, analyzing the polypeptide sequence to obtain the dissociation efficiency, the transport efficiency and the MHC binding efficiency of the polypeptide, and calculating the presentation efficiency of the polypeptide according to the obtained dissociation efficiency, transport efficiency and MHC binding efficiency of the polypeptide.
The method specifically comprises the steps of analyzing a polypeptide sequence by using netCT L pan software to obtain a polypeptide dissociation efficiency (proteolysome differentiation), a transport efficiency (transport efficiency), an MHC binding efficiency and calculating a presentation efficiency according to the obtained result, wherein parameters of the netCT L pan software are 1-a H L A allele, setting H L A allele combined with the polypeptide sequence, wherein the H L A allele contains method genes such as ann, netmhcpan, pickpocket and the like, a plurality of.2) -s can be set to be output according to a certain sequence, the sequence is set to be 0, namely the length of a peptide segment is output according to the comprehensive result (presentation efficiency), 3) -l, and the length which needs to be calculated to be too short.
Through this step, the efficacy value of each polypeptide presented by H L a molecule, i.e. the polypeptide presentation efficacy, was obtained for the subsequent step 5.
And 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
The main criteria are that 1) variant amino acids are positioned in the predicted polypeptide, 2) if the wild type sequence and the non-synonymous mutant sequence are not simultaneously present in the predicted result, then removing 3) relevant information of wild type and non-synonymous mutation, including H L A type, at the original starting and ending position, length, sequence of the predicted polypeptide, predicted binding polypeptide affinity value (IC50) and arranging the wild type and non-synonymous mutation information into a row.
Through this step, the binding affinity of the wild-type/mutant effective polypeptide (containing the amino acid altered due to somatic mutation) and H L a, i.e., the affinity of the potential neoantigen, was obtained for the subsequent step 5.
And 5, scoring and sequencing each corresponding new antigen according to the analysis result, the affinity of the potential new antigen and the presenting effect of the polypeptide, and presenting.
The method specifically comprises the following steps: integrating the prediction result of the binding polypeptide affinity, mutation frequency, tumor purity, polypeptide expression quantity and the like, and comprehensively scoring the prediction result according to related rules, wherein the main indexes are the affinity value (IC50), the polypeptide expression level and the polypeptide length of wild type and mutant type polypeptide sequences, and sequencing potential new antigens. The scoring rules are as follows:
score=abundance*dissimilarity*clonality
wherein the content of the first and second substances,
abundance=L(ICm)*Freq*tanh(TPM);
dissimilarity=(1–L(ICw/50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+ex) Tanh () is hyperbolic tangent function calculation, score is score obtained by new antigen calculation, abundance is abundance of mutant polypeptide, TPM is gene expression abundance, Dissimilarity is Dissimilarity (Dissimilarity) which measures the difference of affinity between mutant peptide segment and corresponding normal peptide segment, clonality (clonality) measures the efficiency NCT L of new antigen from mutation to short peptide and the proportion MC of new antigen distribution in all tumor cells, Freq is somatic cell variation allele expression abundance ratio which is the ratio of expression abundance of somatic cell variation allele to gene expression in tumor cells, IC is ICmFor mutant polypeptide affinities, IC, obtained during affinity predictionwFor the wild-type polypeptide binding affinity obtained at the time of affinity prediction, NCT L was presentation efficacy and MC was mutant gene cloning ratio.
In this example, the method further comprises the following steps:
and 6, optimizing the analysis method of the affinity of the potential new antigen in the step 5 by using a machine learning method based on the experimental result.
The method can specifically comprise the following three specific steps:
step 6A, data collection
A comprehensive data set of the most recent MHC binding peptidases was collected from IEDB (http:// www.iedb.org/database _ export _ v3.php) and the own experimental MHC binding peptide data was added.
Step 6B, data arrangement
The data collected were collated and the measurement of the affinity value for the bound peptide fragment (IC50) greater than 50000nM was taken as 50000 nM. The affinity values of the resulting binding peptides were converted into t ═ 1-log (aff)/log (50000). If data of the repeated sequences exist, the average value of the binding peptide fragment affinity is taken as the final data merging result, and the final data merging result is added to the training data.
Step 6C, data set training
And (3) training the collected and sorted results in the step 6B by using different methods respectively, wherein the different methods comprise pickpocket, netMHCpan, netMHCcs, netMHC and the like. And updating mhc _ i original database files of the software by using the model files obtained by the new training, so as to improve the accuracy of prediction.

Claims (15)

1. The prediction method of the clinical individualized tumor neoantigen comprises the following steps:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing a DNA comparison result and an RNA comparison result;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation allele by using the DNA comparison result and the RNA comparison result to obtain an analysis result;
step 4, predicting the genotype of MHC class I molecules of the tumor cells of the patient, analyzing the affinity of potential new antigens and calculating the presenting effect of the polypeptide according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient;
and 5, scoring and sequencing each corresponding new antigen according to the analysis result, the affinity of the potential new antigen and the presentation efficacy of the polypeptide, and presenting.
2. The method for predicting clinical individualized tumor neoantigen according to claim 1, wherein in step 1, the sequencing data of the tumor cells of the patient are compared with a preset corresponding reference genome, and a Burrows-Wheeler transformation algorithm is adopted to obtain a DNA comparison result; the RNA alignment was obtained using the STAR algorithm.
3. The method for predicting clinical individualized tumor neoantigen according to claim 1, further comprising, before step 1, the steps of:
and 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
4. The method of claim 3, wherein the pre-treatment in step 0 comprises trimming sequencing adaptor sequences, trimming sequencing tag sequences and removing sequences with poor sequencing quality, and the method is performed by using Trimmomatic software.
5. The method for predicting clinically individualized tumor neoantigen according to claim 1, wherein in step 2, the pretreatment comprises deduplication, group information addition and base mass recalculation, and is performed by using GATK software.
6. The method for predicting clinical individualized tumor neoantigen according to claim 1, wherein the step 3 comprises the steps of:
step 3A, analyzing somatic mutation sites in tumor cells by using a DNA comparison result;
step 3B, analyzing according to the comparison result of the somatic mutation site and the RNA to obtain the expression abundance of the somatic mutation allele and the gene expression quantity in the tumor cells;
step 3C, filtering false positive sites in the somatic mutation sites through the expression abundance of the somatic mutation alleles to obtain actual somatic mutation sites;
step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating a copy number variation file of the tumor;
step 3F, calculating the purity and the clone structure of the tumor according to the somatic mutation and the copy number mutation file of the tumor;
and 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
7. The method for predicting clinical individualized tumor neoantigen according to claim 6, wherein in step 3D, the somatic mutation is: further filtering was performed against the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon regions.
8. The method of claim 6, wherein the step 3A is performed by Mutect2 software, the step 3B is performed by Haplotpypercaler software of GATK, the step 3C is performed by FilterMutectCalls software, the step 3D is performed by ANNOVAR software, the step 3E is performed by cnvkit software, and the step 3F is performed by ABSOU L TE software, and the clone structure is calculated by pyClone software.
9. The method of claim 6, wherein the step 3G, the analysis of the abundance of gene expression comprises analyzing the abundance of gene expression by using featurepopulations software to obtain counts, and converting the counts into transcript sequencing fragments per million sequencing fragments.
10. The method of claim 9, wherein the obtained counts values are converted into transcript sequencing fragments per million sequencing fragments by the following formula:
Figure FDA0002406292630000021
wherein R is the count number of the gene, L is that of the geneThe length of the first and second support members,
Figure FDA0002406292630000022
Rknumber of counts for Gene k, LkFor the length of gene k, n is the total number of genes.
11. The method for predicting clinical individualized tumor neoantigen according to claim 1, wherein the step 4 comprises the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation site;
step 4C, predicting the capacity of the MHC class I molecules for combining wild type polypeptide and variant polypeptide according to the non-synonymous mutant protein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the dissociation efficiency, the transport efficiency and the MHC combination efficiency of the polypeptide, and calculating the presentation efficiency of the polypeptide according to the obtained dissociation efficiency, transport efficiency and MHC combination efficiency of the polypeptide;
and 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
12. The method of claim 11, wherein the step 4A is performed by using polsor software, the step 4C is performed by using MHC class I epitopes predictors provided by IEDB, and the step 4D is performed by using netCT L pan software.
13. The method for predicting clinical individualized tumor neoantigen according to claim 11, wherein the step 4B is specifically: downloading a human genome protein sequence fasta file provided by a UCSC website, extracting sequences of 15 amino acids before and after a variant amino acid according to an annotated actual somatic mutation site, respectively obtaining flanking amino acid sequences of a wild type protein and a mutant protein, and combining all the sequences into a fasta file of a polypeptide sequence, namely non-synonymous mutant protein data.
14. The method of predicting clinically individualized tumor neoantigen according to claim 1,
in step 5, the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen and the polypeptide presentation efficacy comprises:
score=abundance*dissimilarity*clonality
wherein the content of the first and second substances,
abundance=L(ICm)*Freq*tanh(TPM);
dissimilarity=(1–L(ICw/50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+ex) Tanh () is hyperbolic tangent function calculation, score is fraction obtained by new antigen calculation, abundance is abundance of mutant polypeptide, TPM is gene expression abundance, Dissimilarity is Dissimilarity to measure difference of affinity between mutant peptide segment and corresponding normal peptide segment, clonality is efficiency NCT L of new antigen from mutation to short peptide and proportion MC of new antigen distribution in all tumor cells, Freq is proportion of expression abundance of somatic cell variant allele, the proportion of expression abundance of somatic cell variant allele is ratio of expression abundance of somatic cell variant allele to gene expression in tumor cells, IC ismFor mutant polypeptide affinities, IC, obtained during affinity predictionwFor wild-type polypeptide binding affinity obtained at the time of affinity prediction, NCT L is presentation potency and MC is mutant gene cloning ratio.
15. The method for predicting clinical individualized tumor neoantigens according to any one of claims 1 to 14, further comprising the steps of:
and 6, optimizing the analysis method of the affinity of the potential new antigen in the step 5 by using a machine learning method based on the experimental result.
CN202010162494.9A 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen Active CN111415707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010162494.9A CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010162494.9A CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Publications (2)

Publication Number Publication Date
CN111415707A true CN111415707A (en) 2020-07-14
CN111415707B CN111415707B (en) 2023-04-25

Family

ID=71492918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010162494.9A Active CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Country Status (1)

Country Link
CN (1) CN111415707B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951887A (en) * 2020-07-27 2020-11-17 深圳市新合生物医疗科技有限公司 Leukocyte antigen and polypeptide binding affinity prediction method based on deep learning
CN112466396A (en) * 2020-12-04 2021-03-09 中山大学附属第一医院 Screening method of tumor high-affinity new antigen and application of tumor high-affinity new antigen in indication of treatment prognosis curative effect of PD-1 of liver cancer patient
CN114446389A (en) * 2022-02-08 2022-05-06 上海科技大学 Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN115424740A (en) * 2022-09-30 2022-12-02 四川大学华西医院 Tumor immunotherapy effect prediction system based on NGS and deep learning
CN116825188A (en) * 2023-06-25 2023-09-29 北京泛生子基因科技有限公司 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
CN116959579A (en) * 2023-09-19 2023-10-27 北京求臻医学检验实验室有限公司 System for reducing errors of second generation sequencing system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120277999A1 (en) * 2010-10-29 2012-11-01 Pbd Biodiagnostics, Llc Methods, kits and arrays for screening for, predicting and identifying donors for hematopoietic cell transplantation, and predicting risk of hematopoietic cell transplant (hct) to induce graft vs. host disease (gvhd)
DK1897548T3 (en) * 2003-02-28 2013-11-18 Univ Johns Hopkins T-cell regulation
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
US20160320406A1 (en) * 2013-12-17 2016-11-03 The Brigham And Women's Hospital, Inc. Detection of an antibody against a pathogen
CN106103711A (en) * 2013-11-21 2016-11-09 组库创世纪株式会社 System and the application in treatment and diagnosis thereof are analyzed in φt cell receptor and B-cell receptor storehouse
CN107704727A (en) * 2017-11-03 2018-02-16 杭州风起智能科技有限公司 Neoantigen Activity Prediction and sort method based on tumour neoantigen characteristic value
US20180100201A1 (en) * 2015-06-29 2018-04-12 The Broad Institute Inc. Tumor and microenvironment gene expression, compositions of matter and methods of use thereof
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109125740A (en) * 2017-06-28 2019-01-04 四川大学 A kind of novel tumor vaccine and application thereof
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
US20190279742A1 (en) * 2017-10-10 2019-09-12 Gritstone Oncology, Inc. Neoantigen identification using hotspots
US20190346442A1 (en) * 2016-04-18 2019-11-14 The Broad Institute, Inc. Improved hla epitope prediction
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN110706742A (en) * 2019-09-30 2020-01-17 中生康元生物科技(北京)有限公司 Pan-cancer tumor neoantigen high-throughput prediction method and application thereof
US20200368336A1 (en) * 2017-12-01 2020-11-26 Shanghai Jenomed Biotech Co., Ltd. Method for preparing personalized cancer vaccine

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK1897548T3 (en) * 2003-02-28 2013-11-18 Univ Johns Hopkins T-cell regulation
US20120277999A1 (en) * 2010-10-29 2012-11-01 Pbd Biodiagnostics, Llc Methods, kits and arrays for screening for, predicting and identifying donors for hematopoietic cell transplantation, and predicting risk of hematopoietic cell transplant (hct) to induce graft vs. host disease (gvhd)
CN106103711A (en) * 2013-11-21 2016-11-09 组库创世纪株式会社 System and the application in treatment and diagnosis thereof are analyzed in φt cell receptor and B-cell receptor storehouse
US20160320406A1 (en) * 2013-12-17 2016-11-03 The Brigham And Women's Hospital, Inc. Detection of an antibody against a pathogen
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
US20180100201A1 (en) * 2015-06-29 2018-04-12 The Broad Institute Inc. Tumor and microenvironment gene expression, compositions of matter and methods of use thereof
US20190346442A1 (en) * 2016-04-18 2019-11-14 The Broad Institute, Inc. Improved hla epitope prediction
CN109125740A (en) * 2017-06-28 2019-01-04 四川大学 A kind of novel tumor vaccine and application thereof
US20190279742A1 (en) * 2017-10-10 2019-09-12 Gritstone Oncology, Inc. Neoantigen identification using hotspots
CN107704727A (en) * 2017-11-03 2018-02-16 杭州风起智能科技有限公司 Neoantigen Activity Prediction and sort method based on tumour neoantigen characteristic value
US20200368336A1 (en) * 2017-12-01 2020-11-26 Shanghai Jenomed Biotech Co., Ltd. Method for preparing personalized cancer vaccine
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN110706742A (en) * 2019-09-30 2020-01-17 中生康元生物科技(北京)有限公司 Pan-cancer tumor neoantigen high-throughput prediction method and application thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BRUNO ALVAREZ: "NNAlign_MA; MHC Peptidome Deconvolution for Accurate MHC Binding Motif Characterization and Improved T-cell Epitope Predictions", 《MOLECULAR AND CELLULAR PROTEMOMICS》 *
YUWEI CHENZHANG QING: "Identification of the impact on T- and B- cell epitopes of Human", 《IMMUNOLOGY LETTERS》 *
唐龙清: "肿瘤精准免疫治疗的临床实践", 《中国医药生物技术》 *
王广志等: "个性化肿瘤新抗原疫苗中抗原肽预测研究进展", 《生物化学与生物物理进展》 *
黄小兰,董海龙,隋延仿: "肝癌MAGE-n抗原HLA-A2限制性细胞毒性T细胞表位的预测" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951887A (en) * 2020-07-27 2020-11-17 深圳市新合生物医疗科技有限公司 Leukocyte antigen and polypeptide binding affinity prediction method based on deep learning
CN112466396A (en) * 2020-12-04 2021-03-09 中山大学附属第一医院 Screening method of tumor high-affinity new antigen and application of tumor high-affinity new antigen in indication of treatment prognosis curative effect of PD-1 of liver cancer patient
CN114446389A (en) * 2022-02-08 2022-05-06 上海科技大学 Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN114446389B (en) * 2022-02-08 2024-05-14 上海科技大学 Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof
CN115424740A (en) * 2022-09-30 2022-12-02 四川大学华西医院 Tumor immunotherapy effect prediction system based on NGS and deep learning
CN115424740B (en) * 2022-09-30 2023-11-17 四川大学华西医院 Tumor immunotherapy effect prediction system based on NGS and deep learning
CN116825188A (en) * 2023-06-25 2023-09-29 北京泛生子基因科技有限公司 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
CN116825188B (en) * 2023-06-25 2024-04-09 北京泛生子基因科技有限公司 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
CN116959579A (en) * 2023-09-19 2023-10-27 北京求臻医学检验实验室有限公司 System for reducing errors of second generation sequencing system
CN116959579B (en) * 2023-09-19 2023-12-22 北京求臻医学检验实验室有限公司 System for reducing errors of second generation sequencing system

Also Published As

Publication number Publication date
CN111415707B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN111415707A (en) Prediction method of clinical individualized tumor neoantigen
JP7217711B2 (en) Identification, production and use of neoantigens
CN109801678B (en) Tumor antigen prediction method based on complete transcriptome and application thereof
CN108796055B (en) Method, device and storage medium for detecting tumor neoantigen based on second-generation sequencing
CN108388773B (en) A kind of identification method of tumor neogenetic antigen
TWI765875B (en) Neoantigen identification, manufacture, and use
CN110600077B (en) Prediction method of tumor neoantigen and application thereof
CN110277135B (en) Method and system for selecting individualized tumor neoantigen based on expected curative effect
Borden et al. Cancer neoantigens: challenges and future directions for prediction, prioritization, and validation
CN110621785B (en) Method and device for haplotyping diploid genome based on three-generation capture sequencing
US20240203522A1 (en) Method and computer program for predicting neoantigen by using peptide sequence and hla allele sequence
CN111755067A (en) Screening method of tumor neoantigen
CN110799196A (en) System for ranking immunogenic cancer-specific epitopes
CN114446389B (en) Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof
Olsen et al. Bioinformatics for cancer immunotherapy target discovery
US20200176076A1 (en) Scansoft: a method for the detection of genomic deletions and duplications in massive parallel sequencing data
CN112210596B (en) Tumor neoantigen prediction method based on gene fusion event and application thereof
CN115240773B (en) New antigen identification method and device, equipment and medium of tumor specific circular RNA
CN114882951A (en) Method and device for detecting MHC II tumor neoantigen based on next generation sequencing data
CN110349627B (en) Design method of polypeptide vaccine sequence and its automatic design product
CN114333998A (en) Tumor neoantigen prediction method and system based on deep learning model
Gerasimov Analysis of ngs data from immune response and viral samples
Esim et al. Determination of malignant melanoma by analysis of variation values
CN113316818B (en) Method for identifying neoantigen
CN111599410B (en) Method for extracting microsatellite unstable immunotherapy new antigen by integrating multiple sets of chemical data and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant