CN116994654A - Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides - Google Patents

Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides Download PDF

Info

Publication number
CN116994654A
CN116994654A CN202311261520.3A CN202311261520A CN116994654A CN 116994654 A CN116994654 A CN 116994654A CN 202311261520 A CN202311261520 A CN 202311261520A CN 116994654 A CN116994654 A CN 116994654A
Authority
CN
China
Prior art keywords
sequence
hla
mhc
protein
peptide fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311261520.3A
Other languages
Chinese (zh)
Other versions
CN116994654B (en
Inventor
陈立
贾明明
刘三阳
李情
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Likang Life Technology Co ltd
Original Assignee
Beijing Likang Life Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Likang Life Technology Co ltd filed Critical Beijing Likang Life Technology Co ltd
Priority to CN202311261520.3A priority Critical patent/CN116994654B/en
Publication of CN116994654A publication Critical patent/CN116994654A/en
Application granted granted Critical
Publication of CN116994654B publication Critical patent/CN116994654B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The application discloses a method, a device and a storage medium for identifying peptide fragments which are combined with MHC-I/HLA-I and identified by TCR, wherein the method comprises the following steps: obtaining a protein sequence, an MHC-I/HLA-I sequence and a peptide fragment sequence; training a protein sequence word segmentation device by utilizing the protein sequence; training a protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model; performing fine adjustment on the trained protein characterization pre-training model by utilizing the MHC-I/HLA-I sequence and the peptide fragment sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a two-class model; and identifying and processing the target wild type/mutant peptide fragments by using the classification model, classifying the mutant peptide fragments, and obtaining candidate neoantigen mutant peptides positively correlated with TCR recognition.

Description

Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides
Technical Field
The application relates to the technical field of neoantigens, in particular to a method, equipment and a storage medium for identifying peptide fragments combined with MHC-I/HLA-I and recognizing TCR.
Background
The earliest tumor vaccines were found in 1994, and recent advances in technology have enabled tumor vaccines to target treatment of patient-specific mutated antigens to activate the patient's immune system by injecting tumor-associated antigens or peptide fragments or mRNAs of tumor-specific antigens into the patient. The key point of breakthrough of personalized tumor vaccine technology is the development of detection and prediction technology of neoantigen to prepare neoantigen vaccine. The predicted binding potential of specific mutant peptides to histocompatibility complexes is an important component of neoantigen detection, since tumor specific antigens are only recognized by the major histocompatibility complex for presentation to the cell surface, are further recognized by T cell receptors, and ultimately kill tumor cells. Thus, it is of great importance to predict the binding of peptide fragments to MHC-I/HLA-I, and whether the bound complex pMHC will be presented to the cell surface and recognized by the TCR.
In recent years, the development of the field of neoantigens has been very active, creating a broad and large database available for training of peptide entry and MHC-I/HLA-I affinity predictive models. Meanwhile, various algorithms are applied to peptide segment and MHC-I/HLA-I combination potential prediction, including a traditional machine learning method and a deep learning method with rapid development potential, and methods such as an artificial neural network ANN, a convolutional neural network algorithm CNN, a Bayesian least square method SMM, a random forest and the like are used for affinity model construction. Although these algorithms are reported to have good performance in test sets, the actual replacement of completely independent data sets is more general, and the affinity predictions of peptide fragments and MHC-I/HLA-I still have high error rates, particularly false positive rates, which have serious implications in clinical applications. This problem arises from the following points: firstly, a comprehensive and large number of data sets for model training cannot be obtained, and more features are difficult to learn from the limited data sets; secondly, the conventional algorithm applied to the prediction model takes less characteristics of the protein sequence into consideration, and sequence characteristics possibly causing important influence when the protein sequences interact; thirdly, the representation mode of the full-length sequence of the MHC-I/HLA-I or the representation mode of the pseudo sequence of the MHC-I/HLA-I can not enable the model to learn the characteristics of important sites in peptide segment combination in a targeted manner, so that the performance of the model is influenced; fourth, the current common methods only focus on pMHC presentation, and the characteristics of mutant peptide fragments such as binding position and amino acid residue type, do not further consider the important influence of central tolerance mechanism on TCR recognition, and influence further selection of more accurate neoantigen mutant peptides.
Disclosure of Invention
In order to solve the problems, the application provides a method, equipment and a storage medium for identifying peptide fragments combined with MHC-I/HLA-I and identified by TCR, and a method for more specifically acquiring key characteristics of tumor specific peptide fragment identification, transportation, presentation and TCR identification, and aiming at the current situation of insufficient data quantity of model construction, new modeling ideas are applied to perform model training and characteristic extraction and grouping, so that the accuracy of identifying neoantigens is improved.
Embodiments of the present application provide a method for identifying a peptide fragment that binds to MHC-I/HLA-I and recognizes a TCR, comprising:
obtaining a protein sequence, an MHC-I/HLA-I sequence and a peptide fragment sequence;
training a protein sequence word segmentation device by utilizing the protein sequence;
training a protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model;
performing fine adjustment on the trained protein characterization pre-training model by utilizing the MHC-I/HLA-I sequence and the peptide fragment sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a two-class model;
identifying and processing the target wild type/mutant peptide fragments by using the two classification models, classifying the mutant peptide fragments, and obtaining candidate neoantigen mutant peptides positively correlated with TCR recognition;
wherein, the MHC-I refers to major histocompatibility complex class I; the HLA-I refers to major histocompatibility complex class I in humans; the TCR refers to a T cell receptor.
Preferably, the peptide fragment sequence comprises a positive peptide fragment sequence and a negative peptide fragment sequence, wherein the positive peptide fragment sequence is a peptide fragment combined with MHC-I/HLA-I, and the label is 1; the negative peptide fragment sequence is randomly generated in the protein sequence according to the characteristics of the positive peptide fragment sequence, and the label is 0.
Preferably, the protein sequence comprises collecting the protein sequence from a protein database; correspondingly, training the word segmentation device of the high-frequency repeated amino acid residue combination motif by utilizing the collected protein sequences to obtain the protein sequence word segmentation device.
Preferably, the training of the protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model comprises the following steps:
inputting the unlabeled protein sequence into a protein sequence word segmentation device for word segmentation treatment to obtain a protein sequence segmented by amino acid functional residue combination motif;
converting the protein sequence after the amino acid functional residue combination motif is split into a protein sequence expressed by a digital;
and training the protein characterization model by using the protein sequence expressed by the numbers to obtain a trained protein characterization pre-training model.
Preferably, the fine tuning of the trained protein characterization pre-training model by using the MHC-I/HLA-I sequence and the peptide fragment sequence after the word segmentation by the protein sequence word segmentation device, to obtain a classification model includes:
inputting the labeled MHC-I/HLA-I sequence into a protein sequence word segmentation device for word segmentation treatment to obtain an MHC-I/HLA-I sequence segmented by amino acid functional residue combination motif, and converting the MHC-I/HLA-I sequence segmented by the amino acid functional residue combination motif into a digital MHC-I/HLA-I sequence;
converting the tagged peptide fragment sequence to a digital representation of the peptide fragment sequence;
and splicing the MHC-I/HLA-I sequence converted into the digital representation and the peptide segment sequence converted into the digital representation to obtain a spliced sequence, and fine-tuning the trained protein characterization pre-training model by utilizing the spliced sequence to obtain a two-class model.
Preferably, the identifying the target wild-type/mutant peptide fragment using the classification model comprises:
inputting the target wild type/mutant peptide fragment into the two classification models, outputting an attention score of each site of the mutant peptide fragment predicted to be positive, and an attention score of the amino acid functional residue combination motif of the MHC-I/HLA-I sequence;
judging the anchor position when the MHC-I/HLA-I sequence is combined with the positive peptide fragment and the key characterization site of the MHC-I/HLA-I sequence according to the attention score of each site of the positive mutant peptide fragment and the attention score of the MHC-I/HLA-I sequence motif, and classifying the positive mutant peptide fragment by utilizing the anchor position when the MHC-I/HLA-I sequence is combined with the peptide fragment and the mutation site of the peptide fragment.
Preferably, the classifying treatment using the anchor position when the MHC-I/HLA-I sequence is combined with a positive mutant peptide fragment and the mutation position relative to a wild-type peptide fragment comprises:
and according to the anchoring site when the MHC-I/HLA-I sequence is combined with the peptide fragment, carrying out positive mutant peptide fragment classification treatment by combining with the position of the mutation, and judging whether the peptide fragment is suitable for being selected as candidate neoantigen mutant peptide positively correlated with TCR recognition according to classification results.
Preferably, the determining whether the candidate neoantigen mutant peptide selected as the TCR recognition positive correlation is suitable according to the classification result comprises:
when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide;
when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide;
when the wild type peptide segment is strong in MHC-I/HLA-I combination and weak in TCR recognition, the mutant peptide segment is non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the wild type peptide segment can be selected as a neoantigen mutant peptide;
when the wild-type peptide fragment is strong MHC-I/HLA-I binding and weak TCR recognition, the mutant peptide fragment is MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and weak TCR recognition, the mutant peptide fragment cannot be selected as the neoantigen mutant peptide.
Aiming at the key link of tumor neogenesis antigen immunotherapy, the application identifies the recognition, transportation and presentation of wild type/mutant peptide and histocompatibility molecular complex MHC-I/HLA-I by applying a deep learning algorithm of natural language processing, identifies the biological characteristics of protein interaction, judges the anchoring site of the peptide and MHC-I/HLA-I and constructs a more representative sequence characterization mode of MHC-I/HLA-I, and simultaneously introduces the influence of a central tolerance mechanism on TCR recognition to further promote the recognition of tumor neogenesis antigen.
Drawings
FIG. 1 is a flow chart of an identification method for identifying peptide fragments that bind to MHC class I/HLA class I and recognize TCR according to the present application;
FIG. 2 is a flow chart of an algorithm for identifying peptide fragments that bind to MHC class I/HLA class I and recognize TCR in accordance with the present application.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. In the following description, suffixes such as "module", "part" or "unit" for representing elements are used only for facilitating the description of the present application, and have no particular meaning in themselves. Thus, "module," "component," or "unit" may be used in combination.
The present application provides a deep learning algorithm for identifying whether wild-type/mutant peptide fragments bind to the major histocompatibility complex MHC class I-I/HLA-I, transport, present to the cell surface and the possibility of TCR recognition, further facilitating the selection of tumor-specific neoantigens. Specifically, the method comprises the following steps: the natural language processing method is applied to the discovery and extraction of protein sequence characteristics and is used for constructing a protein characterization pre-training model; extracting positive correlation amino acid residue combination motif and key site characteristics by utilizing a protein sequence and a protein interaction data set, combining with a peptide fragment, transferring and presenting the positive correlation amino acid residue combination motif and key site characteristics, and carrying out fine adjustment on a protein characterization pre-training model by collecting a peptide sequence positive set and constructing a negative set on the basis of the protein characterization pre-training model to obtain an accurate model capable of deducing the combination possibility of the wild type/mutant peptide fragment and MHC-I/HLA-I. The model obtained by the method can be applied to generic peptide fragment or MHC-I/HLA-I type binding prediction, heuristically reflects biological characteristics presented by the peptide fragment, and is applied to model development to obtain a prediction model with better interpretation and accuracy than a conventional algorithm. At the same time, the recognition characteristics of pMHC and TCR after the peptide segment is combined with MHC-I/HLA-I are further recognized, so that whether the peptide segment has the capacity of activating TCR or not is further predicted, and the peptide segment is combined with MHC, and can be finally used for selecting candidate neoantigens with more potential.
FIG. 1 is a flow chart of a method for identifying peptide fragments that bind to MHC class I and recognize TCR according to the present application, as shown in FIG. 1, comprising:
step S101: obtaining a protein sequence, an MHC-I/HLA-I sequence and a peptide fragment sequence;
step S102: training a protein sequence word segmentation device by utilizing the protein sequence;
step S103: training a protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model;
step S104: performing fine adjustment on the trained protein characterization pre-training model by utilizing the MHC-I/HLA-I sequence and the peptide fragment sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a two-class model;
step S105: and identifying and processing the target wild type/mutant peptide fragments by using the classification model, classifying the mutant peptide fragments, and obtaining candidate neoantigen mutant peptides positively correlated with TCR recognition.
Specifically, the peptide fragment sequence comprises a positive peptide fragment sequence and a negative peptide fragment sequence, wherein the positive peptide fragment sequence is a peptide fragment combined with MHC-I/HLA-I, and the label is 1; the negative peptide fragment sequence is randomly generated in the protein sequence according to the characteristics of the positive peptide fragment sequence, and the label is 0.
Wherein the protein sequence comprises collecting protein sequences from a protein database; correspondingly, training the word segmentation device of the high-frequency repeated amino acid residue combination motif by utilizing the collected protein sequences to obtain the protein sequence word segmentation device.
Further, the training of the protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model comprises the following steps: inputting the unlabeled protein sequence into a protein sequence word segmentation device for word segmentation treatment to obtain a protein sequence segmented by amino acid functional residue combination motif; converting the protein sequence after the amino acid functional residue combination motif is split into a protein sequence expressed by a digital; and training the protein characterization model by using the protein sequence expressed by the numbers to obtain a trained protein characterization pre-training model.
More specifically, the method for fine tuning the trained protein characterization pre-training model by using the MHC-I/HLA-I sequence and the peptide fragment sequence after word segmentation by the protein sequence word segmentation device, and obtaining a classification model comprises the following steps: inputting the labeled MHC-I/HLA-I sequence into a protein sequence word segmentation device for word segmentation treatment to obtain an MHC-I/HLA-I sequence segmented by amino acid functional residue combination motif, and converting the MHC-I/HLA-I sequence segmented by the amino acid functional residue combination motif into a digital MHC-I/HLA-I sequence; converting the tagged peptide fragment sequence to a digital representation of the peptide fragment sequence; and splicing the MHC-I/HLA-I sequence converted into the digital representation and the peptide segment sequence converted into the digital representation to obtain a spliced sequence, and fine-tuning the trained protein characterization pre-training model by utilizing the spliced sequence to obtain a two-class model.
Further, the identifying the target wild-type/mutant peptide fragment using the classification model comprises: inputting the target wild type/mutant peptide fragment into the two classification models, outputting an attention score of each site of the mutant peptide fragment predicted to be positive, and an attention score of the amino acid functional residue combination motif of the MHC-I/HLA-I sequence; judging the anchor position when the MHC-I/HLA-I sequence is combined with the positive peptide fragment and the key characterization site of the MHC-I/HLA-I sequence according to the attention score of each site of the positive mutant peptide fragment and the attention score of the MHC-I/HLA-I sequence motif, and classifying the positive mutant peptide fragment by utilizing the anchor position when the MHC-I/HLA-I sequence is combined with the peptide fragment and the mutation site of the peptide fragment.
Further, the classifying treatment using the anchor position when the MHC-I/HLA-I sequence is combined with the positive mutant peptide fragment and the mutation position relative to the wild-type peptide fragment comprises: and according to the anchoring site when the MHC-I/HLA-I sequence is combined with the peptide fragment, carrying out positive mutant peptide fragment classification treatment by combining with the position of the mutation, and judging whether the peptide fragment is suitable for being selected as candidate neoantigen mutant peptide positively correlated with TCR recognition according to classification results.
More specifically, the determining whether the candidate neoantigen mutant peptide selected as the positive correlation for TCR recognition based on the classification result includes: when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide; when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide; when the wild type peptide segment is strong in MHC-I/HLA-I combination and weak in TCR recognition, the mutant peptide segment is non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the wild type peptide segment can be selected as a neoantigen mutant peptide; when the wild-type peptide fragment is strong MHC-I/HLA-I binding and weak TCR recognition, the mutant peptide fragment is MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and weak TCR recognition, the mutant peptide fragment cannot be selected as the neoantigen mutant peptide.
An electronic device provided by an embodiment of the present application includes: a memory; a processor; a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement a method, apparatus and storage medium for identifying peptide fragments that bind to MHC-I/HLA-I and that are recognized by TCRs.
A computer-readable storage medium provided by an embodiment of the present application has a computer program stored thereon; the computer program is executed by the processor to implement a method for identifying peptide fragments that bind to MHC-I/HLA-I and recognize TCR.
In summary, the present application provides a method for deducing the binding probability of peptide fragments to MHC-I/HLA-I, learning biological characteristics from protein sequences, based on pre-trained models, for obtaining interpreted and accurate peptide fragments and MHC-I/HLA-I binding prediction models, comprising: constructing a protein sequence word segmentation device and a protein characterization pre-training model by using a large amount of unlabeled protein data; constructing a labeled training set, a validation set and a test set comprising positive/negative data sets; wherein the positive data set comprises positive peptide sequences obtained from a database by mass spectrometry or in vitro binding experiments and their corresponding MHC-I/HLA-I subtypes; wherein the negative sequence is a negative peptide sequence randomly generated in the protein sequence according to the sequence length characteristics of the positive peptide fragment, and the negative peptide fragment needs to be removed from the peptide fragment existing in the positive peptide fragment. Performing word segmentation on the MHC-I/HLA-I subtype by using a protein sequence word segmentation device, and converting the MHC-I/HLA-I sequence and the peptide fragment sequence into digital representations; building a two-class model, performing fine adjustment on the basis of the protein characterization pre-training model, and performing supervised learning on the generated digital representation data to further improve the performance of the model on the task; further applying a classification model, obtaining a characteristic MHC-I/HLA-I mode according to the attention score of the predicted positive peptide fragment and the MHC-I/HLA-I sequence motif, and deducing the binding anchoring site of the peptide fragment and the MHC-I/HLA-I; more deeply, a method for deducing whether peptide fragments can be combined with MHC-I/HLA-I and activating TCR reaction and activating immune reaction is proposed, the anchoring position of the peptide fragments combined with MHC-I/HLA-I is utilized, the wild type/mutant peptide fragments combined with MHC-I/HLA-I and TCR recognition types are classified according to the position of the peptide fragments where mutation is positioned and the influence of central tolerance, and the nascent antigen peptide fragments are further removed according to the classification result, so that more accurate nascent antigen is obtained.
The following will describe the contents of the present application in detail by referring to FIG. 2
Example 1
Step 1, acquiring a data set required by model training and screening the data set
11 Protein sequences are obtained from UniProt and other databases;
12 Obtaining positive peptide sequences from IEDB and other databases, wherein the positive peptide fragments simultaneously meet the following conditions:
positive peptide with peptide segment length of 8-15 bp;
the MHC/HLA type recognizing positive peptides is type I;
the corresponding wild-type peptide has a specific UniProt number.
13 HLA sequences are obtained from the database of IPD-IMGT/HLA and the like, and HLA-I sequences subjected to multi-sequence comparison are obtained and used for representing the HLA sequences.
Step 2, constructing a positive data set and a negative data set
Wherein the positive data set is a protein sequence/positive peptide sequence, wherein the protein sequence is used for constructing a pre-training model, and learning the expression characteristics and the context of the protein sequence. The positive peptide sequence is a peptide segment with a label and capable of being combined with the MHC-I/HLA-I, the negative sequence is randomly constructed on the basis of the positive sequence, namely the peptide segment which is not combined with the MHC-I/HLA-I is deduced according to the positive peptide segment, and the construction method of the negative peptide segment is as follows:
a negative peptide with a peptide fragment length of 8-15 bp;
randomly transforming amino acid generation based on the same protein sequence of the positive peptide;
the peptide fragments present in the positive peptides were removed.
Step 3, constructing and applying a protein sequence word segmentation device
31 Amino acid residue combinations are performed on a large number of proteins by means of a mechanical algorithm or a statistical inference model, and it is inferred that amino acid residue combinations frequently occur in the protein sequence, which are likely to be functional motif or play an important role in the functioning of the protein. Meanwhile, the point position of the important motif can be deduced, and the word segmentation device is more beneficial to helping to explore key features when the protein functions, unlike simple focusing on single amino acid residues.
32 Protein sequence word segmentation). All protein sequences are segmented by a segmenter, and amino acid residues which can jointly act are combined together so as to facilitate the subsequent model construction.
The embodiment of the application can learn the combination mode of amino acid residues in a protein sequence in a protein database, combines high-frequency combination residues in the protein sequence into a group, and constructs a protein sequence word segmentation device, which comprises the following steps: downloading a protein data set from a protein database, and extracting protein sequences from the protein data set; the protein sequence is trained, and the amino acid residue word segmentation device of the protein sequence with the interpretability is trained according to the characteristics of the occurrence frequency, left and right connectivity and the like of the amino acid residue/residue combination in the known protein sequence.
Step 4, protein characterization pre-training model construction
The large number of segmented protein sequences conceal the relation among the functional related characteristics of the protein, the protein expression mode and the motifs in the sequences, so that the motifs in the protein sequences are randomly concealed by using a mask language model in natural language processing, predicted and learned. The mask code language model is a self-supervision learning task, and aims to enable the model to learn contextual information and language rules of motifs in pre-training, enable the model to receive an input sequence, randomly select a certain proportion of motifs in the input text sequence for masking, and enable the model to predict the masked motifs. For example, the following sentence is given:
input sequence: as shown in SEQ ID NO. 1; SAE DVL KE YD RRRR ME ALLL
Mask sequence: masking based on the sequence shown in SEQ ID NO.1, wherein the sequence is MAF SAE [ mask ] KE YD RRRR ME ALLL
The application learns rich protein motif related information by performing unsupervised learning on a large amount of unlabeled protein data, and then applies a transfer learning technology to apply the models to specific labeled classification tasks, comprising: inputting a large number of unlabeled protein sequences into a protein sequence word segmentation device, segmenting the protein sequences, and representing the protein sequences in the form of amino acid combination residues; a mask language pre-training model (MLM) is constructed, random amino acid residue combination masking is carried out in the protein sequence after the input word segmentation, and the model learns the context information and expression rules of the protein sequence when deducing the masked amino acid residue combination, so that the model is helped to learn key residue characteristics.
Step 5, fine tuning protein characterization pre-training model
51 The generated labeled positive peptide fragment which can be combined with MHC-I/HLA-I and the generated negative peptide fragment which is randomly generated by the positive peptide fragment are respectively labeled with 1/0, and it is worth noting that the positive data are obtained through experiments, which means that only limited labeled training data can be obtained, and the reason that a large amount of unlabeled protein data are used for preparing a pre-training model.
52 The data of the tag in 51) is divided into a training set/a verification set/a test set according to a certain proportion.
53 Fine-tuning the generated pre-trained model using the data in 52), input to build a two-class model.
54 The model structure can be seen in the algorithm flow chart of the application.
Step 6, two classification model applications
61 A peptide fragment having a potential to bind to MHC-I/HLA-I is identified.
62 Extracting important features in a protein sequence, including: protein functional amino acid residues combine motif, important role/binding sites, etc.
By the method, the peptide fragment with binding potential to MHC-I/HLA-I can be predicted from biological characteristics, and important characteristics can be identified.
Step 7, identifying the influence of the mutation position on the binding of the peptide fragment to MHC-I/HLA-I and the TCR recognition potential
71 Based on the obtained peptide fragment and MHC-I/HLA-I binding site, the position of the mutation that binds to the wild peptide fragment/mutant peptide fragment, the pMHC binding and TCR recognition types are classified into four classes, considering the influence of the central tolerance mechanism in the screening of neoantigens, including: the first 3 classes with neoantigen potential, the last 1 classes that need to be excluded from neoantigen screening:
first category: wild peptide is weak MHC-I/HLA-I binding, weak TCR recognition, mutant peptide is MHC-I/HLA-I binding site mutation, and is converted into strong MHC-I/HLA-I and strong TCR recognition, and the mutation can activate TCR when the mutation occurs at an anchor position without central tolerance;
the second category: wild peptide is weak MHC-I/HLA-I binding, weak TCR recognition, mutant peptide is non-MHC-I/HLA-I binding site mutation, and is converted into strong MHC-I/HLA-I and strong TCR recognition, and under the condition of no central tolerance, the mutation occurs in a non-anchor position, so that TCR can be activated;
third category: wild peptide is strong MHC-I/HLA-I binding, weak TCR recognition, mutant peptide is non-MHC-I/HLA-I binding site mutation, and is converted into strong MHC-I/HLA-I and strong TCR recognition, and under the condition of central tolerance, the mutation occurs in a non-anchor position to activate TCR;
fourth category: wild peptide is strong MHC-I/HLA-I binding, weak TCR recognition, mutant peptide is MHC-I/HLA-I binding site mutation, and is converted into strong MHC-I/HLA-I and weak TCR recognition. In the case of central tolerance, mutations occur in non-anchor positions and do not activate the TCR. Further explained is: in this combination, the mutation site occurs at the MHC-I/HLA-I binding site, the sequence exposed at the recognition end of the TCR is not altered, and further this peptide fragment is still affected by central immunological tolerance and is not a suitable neoantigenic peptide.
In summary, the present application deduces, from biological characteristics and attention scores, characteristics that play an important role in peptide fragment and MHC-I/HLA-I recognition, including: a method for characterizing an MHC-I/HLA-I sequence, obtaining a key acting amino acid residue combination and a position thereof, predicting the action of the amino acid residue combination type and different positions thereof in the MHC-I/HLA-I sequence in peptide segment recognition, combination and presentation according to a classification model, and obtaining the key acting amino acid residue combination of a key site according to an attention score; deducing the key binding site of the peptide fragment and obtaining the key acting amino acid and the position of the key acting amino acid. And classifying the combination of the wild type/mutant peptide fragment and MHC-I/HLA-I and the identification type of TCR according to the position of the peptide fragment where the mutation is positioned by utilizing the obtained peptide fragment combining site information, introducing the influence of central tolerance on the screening of the new antigen, and further removing the new antigen peptide fragment according to the classification result to obtain more accurate new antigen.
Example two
The application obtains a protein sequence from a UniProt protein database to construct a word segmentation device; obtaining positive peptide and MHC-I/HLA-I binding sequence from IEDB, and generating negative peptide segment aiming at the positive sequence; constructing a protein characterization pre-training model by the protein sequence after word segmentation by the word segmentation device; converting the MHC-I/HLA-I sequence and the peptide fragment after word segmentation by the word segmentation device into a digital representation format, splicing the digital representation format, inputting a protein characterization pre-training model to finely adjust the digital representation format, and finally obtaining a two-class model capable of predicting the binding potential of the peptide fragment and the MHC-I/HLA-I; extracting key action sites and motif in a protein sequence according to the characteristics of a model and an algorithm generated by training, and representing an MHC-I/HLA-I sequence; the antigen peptide is further classified by utilizing the position of the tumor specific mutation on the antigen peptide and the binding potential with MHC, so that the antigen peptide which cannot be recognized by TCR due to the influence of central immune tolerance is eliminated, and finally, more accurate and comprehensive neoantigen is obtained. The method specifically comprises the following steps:
step 11, construction and use of word segmentation device
Training was performed using sentencepie, setting the vocabulary size to 10,000, the word splitter effect was as follows:
before word segmentation: as shown in SEQ ID NO. 2: MIINPTSDPEVSALEKKNTGRIAQIIGPVLDVTFPPGKMP … …
After word segmentation: as shown in SEQ ID NO. 2: MII NP TTSD PE VS ALEK KN TGRI AQI IGP VLD VT FP PG KMP … …
Step 12, pre-training model construction
The method comprises the steps of constructing a protein characterization pre-training model by using a natural language model, performing unsupervised training on a segmented protein sequence, learning language characteristics and context relation of the protein sequence, setting a mask ratio to be 15%, and setting the used protein sequence to be 568,745 and the total length to be 206M.
Step 13, fine tuning protein characterization pre-training model
111 Generating corresponding negative peptide according to positive peptide combined with MHC-I/HLA-I, segmenting MHC-I/HLA-I sequence, marking negative and positive data items as 0/1 respectively, and distributing positive data sets as follows:
as shown in Table above, there are only 121 HLA classes, so protein characterization using such limited data sets is very difficult, and therefore this is why protein characterization pre-training models were constructed first.
112 Raw data input into the protein characterization pre-training model includes examples of the sequence formats shown in SEQ ID No.3 and SEQ ID No. 4:
E I E I C D G F,M RVT AP RTLL LLL WG AV ALT ET WAG SH SM RY FH TS VS RPG RG EP RFI TVG YVD DT……AGL AV ------ L AVV - VIG AV VAA VM - CRR KS SGG KGG SY SQ AAC SD S AQG SD VSL TA ------------,0
wherein the amino acid sequence of SEQ ID NO.3 is: eiEiC D G F
The amino acid sequence of SEQ ID NO.4 is: m RVT AP RTLL LLL WG AV ALT ET WAG SH SM RY FH TS VS RPG RG EP RFI TVG YVD DT … … AGL AV-L AVV-VIG AV VAA VM-CRR KS SGG KGG SY SQ AAC SD S AQG SD VSL TA-)
Step 14, two classification model applications
221 According to the data format of 112), the binding potential of peptide fragments and MHC-I/HLA-I can be predicted, the input peptide sequences comprise sequences shown in SEQ ID NO.5 and SEQ ID NO.6, the MHC sequences are respectively shown in SEQ ID NO.7 and SEQ ID NO.8, and the input results are as follows:
wherein Label is the actual Label, prediction is the model predictive Label
222 Evaluation of the classification modeling.
The evaluation dataset is a validation set independent of the dataset used for model training, the validation set being constructed as follows:
model performance was calculated using the following formula:
TP: true positive; TN: true negative; FP: false positives; FN: false negative;
AUC is a plot of True Positive Rate (TPR) versus False Positive Rate (FPR).
Model performance evaluation was performed compared to the usual NetMHCpan4.1[5] software:
223 The method for extracting the MHC-I/HLA-I binding site, obtaining the position of the amino acid residue with key effect, explaining the predicted performance of the model from the biological point of view, drawing a heat map according to the attention scores of different sites, and the brighter the color is, the more likely to be the key binding site of peptide fragments and MHC, also proving the good performance of the model on the side, and can be used for the key site feature extraction of the following MHC-I/HLA-I sequence.
224 Characterization of protein sequences, acquisition of key acting amino acid residue combinations and the method of the positions of the key acting amino acid residue combinations, and interpretation of model prediction performance from a biological perspective. Drawing a heat map of the attention score at each position on the HLA-A/B/C sequence, deducing the overall key binding site for the MHC-I/HLa-I sequence, helping to better characterize the MHC-I/HLa-I sequence, and more key sites beyond the above sites, such as 39,235 and other sites, are extracted to better fit the functional significance of the protein sequence implications, as well as the common lkmhhcpan or MHCflurry providing the hot sites (31, 33, 48, 69, 83, 86, 87, 90, 91, 93, 94, 97, 98, 100, 101, 104, 105, 108, 115, 119, 121, 123, 126, 138, 140, 142, 167, 171, 174,176, 180, 182, 183, 187, 191, 195, 223).
Step 15, increasing the classification information of peptide fragment binding to MHC-I/HLA-I and TCR recognition, and promoting the recognition of new antigen sequences more in detail and comprehensively.
According to the above analysis to obtain peptide fragment and MHC-I/HLA-I binding anchor position, the peptide fragment types can be divided into 4 groups according to the binding potential of wild peptide/mutant peptide and tumor specific mutation position, as shown in the following table:
the antigen peptide generated by the type 4 is removed in the neoantigen screening, so that the accuracy of neoantigen prediction is improved.
In summary, the application discloses a novel method for predicting binding, transporting, presenting and activating TCR (tumor cell receptor) of wild type/mutant peptide fragment and MHC-I/HLA-I, which is to extract protein residue combinations and key sites playing important roles in the binding process, better characterize MHC-I/HLA-I sequences and construct a more accurate and biologically interpretable model predictor by applying a transfer learning technology. And exploring MHC-I/HLA-I key binding sites by using the constructed model, and further classifying the new antigens according to the position of the peptide segment where the mutation is positioned, thereby improving the accuracy of new antigen recognition. Compared with the traditional classifier, the method is more efficient and accurate, and has important application value for tumor immunotherapy.
Compared with other existing analysis methods, the method has the following advantages:
(1) The biological characteristics of the protein sequence itself are used in model construction. By learning the combination mode of amino acid residues in the protein sequence, the high-frequency combination of the amino acid residues with obvious characteristics and key sites are found, a word segmentation device is constructed, and the amino acid sequence is more reasonably characterized.
(2) The natural language processing algorithm is applied to the biological information language-protein sequence, the protein sequence is regarded as a sentence, a prediction model is constructed by using a natural language processing method, and the characteristics and the context in the sequence are learned.
(3) And constructing a pre-training model through a large amount of existing unlabeled data, and performing fine tuning on the pre-training model by using the labeled data. The pre-training model is aided in learning rich language knowledge by constructing a masking model to infer masked combinations of amino acid residues to learn contextual information and expression rules of protein sequences. And then performing pretrained model fine tuning by using limited labeled peptide fragments and MHC-I/HLA-I binding data to construct a classification model more suitable for affinity prediction.
(4) The attention score of each site/motif in the protein sequence can be output by means of a natural language processing model, so that the key site when the MHC-I/HLA-I is combined with the peptide segment is judged, and the key combining site of the peptide segment and the more representative and significant characterization mode of the MHC-I/HLA-I sequence are judged.
(5) And by judging the position of the peptide segment where the tumor specific mutation is located, grouping different pMHC binding and TCR recognition modes, and further extracting the antigen peptide segment with higher accuracy. According to the literature, it is reported that the antigen sequence is recognized by the TCR while binding to the MHC, and that the location of the tumor-specific antigen mutation affects both MHC binding and TCR recognition. Therefore, we obtain key MHC binding sites according to the attention score of each antigen peptide position, and further classify new antigens according to mutation positions, so that peptide fragments are removed from the antigen peptide, and the antigen peptide screening accuracy is improved by removing peptide fragments which have large binding potential with MHC-I/HLA-I but cannot activate TCR. The improved algorithm can more accurately predict whether the wild type/mutant peptide fragment is combined with a main histocompatibility complex, transport and presentation and TCR recognition potential, and the identification and clinical application of tumor new antigens are promoted.
The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the present application shall fall within the scope of the appended claims.

Claims (10)

1. A method for identifying MHC-I/HLA-I binding and TCR-recognizing peptide fragments, comprising:
obtaining a protein sequence, an MHC-I/HLA-I sequence and a peptide fragment sequence;
training a protein sequence word segmentation device by utilizing the protein sequence;
training a protein characterization model by using the protein sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a trained protein characterization pre-training model;
performing fine adjustment on the trained protein characterization pre-training model by utilizing the MHC-I/HLA-I sequence and the peptide fragment sequence subjected to word segmentation by the protein sequence word segmentation device to obtain a two-class model;
identifying and processing the target wild type/mutant peptide fragments by using the two classification models, classifying the mutant peptide fragments, and obtaining candidate neoantigen mutant peptides positively correlated with TCR recognition;
wherein, the MHC-I refers to major histocompatibility complex class I; the HLA-I refers to major histocompatibility complex class I in humans; the TCR refers to a T cell receptor.
2. The method of claim 1, wherein the peptide fragment sequences comprise a positive peptide fragment sequence and a negative peptide fragment sequence, wherein the positive peptide fragment sequence is a peptide fragment that binds to MHC-I/HLA-I, labeled 1; the negative peptide fragment sequence is randomly generated in the protein sequence according to the characteristics of the positive peptide fragment sequence, and the label is 0.
3. The method of claim 2, wherein the protein sequence comprises collecting the protein sequence from a protein database; correspondingly, training the word segmentation device of the high-frequency repeated amino acid residue combination motif by utilizing the collected protein sequences to obtain the protein sequence word segmentation device.
4. A method according to claim 3, wherein the training of the protein characterization model using the protein sequence subjected to the word segmentation by the protein sequence word segmentation unit to obtain the trained protein characterization pre-training model comprises:
inputting the unlabeled protein sequence into a protein sequence word segmentation device for word segmentation treatment to obtain a protein sequence segmented by amino acid functional residue combination motif;
converting the protein sequence after the amino acid functional residue combination motif is split into a protein sequence expressed by a digital;
and training the protein characterization model by using the protein sequence expressed by the numbers to obtain a trained protein characterization pre-training model.
5. The method of claim 4, wherein fine-tuning the trained protein characterization pre-training model using the MHC-I/HLA-I sequences and peptide sequences after word segmentation by the protein sequence word segmenter to obtain a classification model comprises:
inputting the labeled MHC-I/HLA-I sequence into a protein sequence word segmentation device for word segmentation treatment to obtain an MHC-I/HLA-I sequence segmented by amino acid functional residue combination motif, and converting the MHC-I/HLA-I sequence segmented by the amino acid functional residue combination motif into a digital MHC-I/HLA-I sequence;
converting the tagged peptide fragment sequence to a digital representation of the peptide fragment sequence;
and splicing the MHC-I/HLA-I sequence converted into the digital representation and the peptide segment sequence converted into the digital representation to obtain a spliced sequence, and fine-tuning the trained protein characterization pre-training model by utilizing the spliced sequence to obtain a two-class model.
6. The method of claim 5, wherein said identifying the wild-type/mutant peptide fragment of interest using the classification model comprises:
inputting the target wild type/mutant peptide fragment into the two classification models, outputting an attention score of each site of the mutant peptide fragment predicted to be positive, and an attention score of the amino acid functional residue combination motif of the MHC-I/HLA-I sequence;
judging the anchor position when the MHC-I/HLA-I sequence is combined with the positive peptide fragment and the key characterization site of the MHC-I/HLA-I sequence according to the attention score of each site of the positive mutant peptide fragment and the attention score of the MHC-I/HLA-I sequence motif, and classifying the positive mutant peptide fragment by utilizing the anchor position when the MHC-I/HLA-I sequence is combined with the peptide fragment and the mutation site of the peptide fragment.
7. The method of claim 6, wherein the classifying using the anchor position of the MHC-I/HLA-I sequence when bound to a positive mutant peptide fragment and the mutation position relative to a wild-type peptide fragment comprises:
and according to the anchoring site when the MHC-I/HLA-I sequence is combined with the peptide fragment, carrying out positive mutant peptide fragment classification treatment by combining with the position of the mutation, and judging whether the peptide fragment is suitable for being selected as candidate neoantigen mutant peptide positively correlated with TCR recognition according to classification results.
8. The method of claim 7, wherein determining whether the candidate neoantigen mutant peptide selected as having a positive TCR recognition correlation is suitable based on the classification result comprises:
when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide;
when the wild type peptide segment is in weak MHC-I/HLA-I combination and weak TCR recognition, the mutant peptide segment is in non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the mutant peptide can be selected as a neoantigen mutant peptide;
when the wild type peptide segment is strong in MHC-I/HLA-I combination and weak in TCR recognition, the mutant peptide segment is non-MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and strong TCR recognition, judging that the wild type peptide segment can be selected as a neoantigen mutant peptide;
when the wild-type peptide fragment is strong MHC-I/HLA-I binding and weak TCR recognition, the mutant peptide fragment is MHC-I/HLA-I anchoring site mutation and is converted into strong MHC-I/HLA-I and weak TCR recognition, the mutant peptide fragment cannot be selected as the neoantigen mutant peptide.
9. An electronic device, comprising: a memory; a processor; a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program being executed by a processor to implement the method of any of claims 1-8.
CN202311261520.3A 2023-09-27 2023-09-27 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides Active CN116994654B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311261520.3A CN116994654B (en) 2023-09-27 2023-09-27 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311261520.3A CN116994654B (en) 2023-09-27 2023-09-27 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides

Publications (2)

Publication Number Publication Date
CN116994654A true CN116994654A (en) 2023-11-03
CN116994654B CN116994654B (en) 2023-12-29

Family

ID=88534245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311261520.3A Active CN116994654B (en) 2023-09-27 2023-09-27 Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides

Country Status (1)

Country Link
CN (1) CN116994654B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773517A (en) * 2005-11-10 2006-05-17 上海交通大学 Protein sequence characteristic extracting method based on Chinese participle technique
CN110752041A (en) * 2019-10-23 2020-02-04 深圳裕策生物科技有限公司 Method, device and storage medium for predicting neoantigen based on next generation sequencing
WO2020132235A1 (en) * 2018-12-20 2020-06-25 Merck Sharp & Dohme Corp. Methods and systems for the precise identification of immunogenic tumor neoantigens
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN116368570A (en) * 2020-10-15 2023-06-30 米尼奥公司 Methods, systems, and computer program products for determining peptide immunogenicity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1773517A (en) * 2005-11-10 2006-05-17 上海交通大学 Protein sequence characteristic extracting method based on Chinese participle technique
WO2020132235A1 (en) * 2018-12-20 2020-06-25 Merck Sharp & Dohme Corp. Methods and systems for the precise identification of immunogenic tumor neoantigens
CN110752041A (en) * 2019-10-23 2020-02-04 深圳裕策生物科技有限公司 Method, device and storage medium for predicting neoantigen based on next generation sequencing
CN116368570A (en) * 2020-10-15 2023-06-30 米尼奥公司 Methods, systems, and computer program products for determining peptide immunogenicity
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning

Also Published As

Publication number Publication date
CN116994654B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
JP7459159B2 (en) GAN-CNN for MHC peptide binding prediction
CN113762417B (en) Method for enhancing HLA antigen presentation prediction system based on deep migration
Yao et al. SVMTriP: a method to predict B-cell linear antigenic epitopes
CN114446389B (en) Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof
Sidhom et al. DeepTCR: a deep learning framework for understanding T-cell receptor sequence signatures within complex T-cell repertoires
Sidhom et al. DeepTCR: a deep learning framework for revealing structural concepts within TCR Repertoire
CN116994654B (en) Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides
CN113838524A (en) S-nitrosylation site prediction method, model training method and storage medium
JP4703487B2 (en) Image classification method, apparatus and program
Hu et al. Ensemble approaches for improving HLA class I-peptide binding prediction
CN114242159B (en) Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device
TWI835007B (en) Computer-implemented method and system for predicting binding and presentation of peptides by mhc molecules, computer-implemented method for performing multiple instance learning and tangible, non-transitory computer-readable medium
KR20240110613A (en) Systems and methods for evaluating immunological peptide sequences
CN101609486B (en) Identification method of superclass of G-protein-coupled receptors and Web service system thereof
CN115497564A (en) Antigen identification model establishing method and antigen identification method
CN103258146A (en) Delaminating and classifying method for G-protein-coupled receptor family
CN106709277A (en) Text-mining-based vector generating method of G-protein coupled receptor drug target molecules
Gupta et al. Comparative analysis of epitope predictions: proposed library of putative vaccine candidates for HIV
Liu et al. A Deep Learning Approach for NeoAG-Specific Prediction Considering Both HLA-Peptide Binding and Immunogenicity: Finding Neoantigens to Making T-Cell Products More Personal
Kouzani Subcellular localisation of proteins in fluorescent microscope images using a random forest
Adiga Benchmarking Datasets from Malaria Cytotoxic T-cell Epitopes Using Machine Learning Approach
Kutuzova et al. Taxometer: Improving taxonomic classification of metagenomics contigs
Shen et al. Supervised contrastive learning enhances MHC-II peptide binding affinity prediction
Wei et al. ToxinMI: improving peptide toxicity prediction by fusing multimodal information based on mutual information
Carter Interpretations of Machine Learning and Their Application to Therapeutic Design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant