CN117594114A

CN117594114A - Method for predicting antibody-like body combined with biomacromolecule modification based on protein structural domain

Info

Publication number: CN117594114A
Application number: CN202311419536.2A
Authority: CN
Inventors: 李磊; 姜豪强; 邹洋; 张汉忠; 周长景
Original assignee: Qingdao Sanuo Gene Technology Co ltd; Rehabilitation University Preparatory
Current assignee: Qingdao Sanuo Gene Technology Co ltd; Rehabilitation University Preparatory
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-02-23

Abstract

The invention discloses a method for predicting antibody-like bodies combined with biomacromolecule modification based on protein structural domains. The method comprises the steps of constructing a protein domain mutation library, panning, carrying out high-throughput sequencing, processing high-throughput sequencing data, constructing a training data set of a quasi-antibody prediction model, constructing and training the quasi-antibody prediction model based on a convolutional neural network, predicting and screening candidate quasi-antibodies by using the model, exploring a characteristic sequence and the like. The method can be used for exploring the sequence characteristics of the antibody-like bodies, realizing the rapid screening of the antibody-like bodies, being beneficial to finding the antibody-like bodies with high affinity which are difficult to find by phage display or biopanning technology, and having the advantages of low cost and high precision.

Description

Method for predicting antibody-like body combined with biomacromolecule modification based on protein structural domain

Technical Field

The invention belongs to the technical field of bioinformatics. More particularly, to a method for predicting antibody-like binding to a biomacromolecule modification based on a protein domain.

Background

The biological macromolecule modification refers to covalent bonding of small molecule groups on the bases of DNA, RNA or amino acid side chains of proteins, thereby regulating the functions of biological macromolecules. The development of antibodies modified by combining with biological macromolecules is helpful for accurately detecting pathological states caused by the change of biological macromolecule modification sites, and provides assistance for diagnosis and treatment of diseases.

By antibody-like is meant a mutant which is capable of binding to a biological macromolecule modification (e.g., binding to a tyrosine phosphorylation modified SH2 domain or to a lysine methylation modified Chromo domain) and which produces a protein having a substantially higher affinity for the biological macromolecule modification than the wild-type, similar to the level of binding of an antigen-antibody (nM or pM). Briefly, a class of antibodies is a protein that is functionally similar to antibodies and is capable of binding with high affinity to an antigen. Because the biomacromolecule modified groups are smaller, the development difficulty of antibodies capable of efficiently binding and enriching the modified groups is great. While engineering the domain of proteins known to bind to biomacromolecule modifications may develop antibody-like antibodies that bind with high affinity to the modifying group. For example, a library of mutants of the target protein is constructed by phage display technology, and the library is screened with a modifying group to find the antibody-like.

Currently, there is a method (SH 2 super-binder) of searching for whether a tyrosine phosphorylation modified (phosphorylated tyrosine, pY) binding pocket in SH2 protein structure can be improved in affinity with pY by a method of large-scale mutation using SH2 domain as a research target (Kaneko, t., et al (2012), "super-binder SH2 domains act as antagonists of cell signaling.," Science signaling 5 (243): ra68 "). Similarly, methods of directed evolution were also used to find pY class antibodies (Li, s., et al (2021), "Revisiting the phosphotyrosine binding pocket of Fyn SH2 domain led to the identification of novel SH2 superbinders":Protein Science30 (3):558-570). However, the method needs phage display and biopanning technology to perform panning on a phage-displayed mutant library, and has long experimental method period and high cost. Accordingly, there is a need to provide a method capable of efficiently discovering a class of antibodies.

Disclosure of Invention

The invention aims at the technical problems, provides a method for predicting the antibody combined with biomacromolecule modification based on a protein structural domain, and also provides the antibody combined with tyrosine-containing phosphorylation modified peptide based on the method.

It is a first object of the present invention to provide a method for predicting antibody-like binding to a biomacromolecule modification based on a protein domain.

It is a second object of the present invention to provide a system for predicting antibody-like binding to a biomacromolecule modification based on a protein domain.

It is a third object of the present invention to provide a class of antibodies which bind to peptides containing tyrosine phosphorylation modifications and uses thereof.

The above object of the present invention is achieved by the following technical scheme:

the invention provides a method for predicting a quasi-antibody combined with biomacromolecule modification based on a protein structural domain, which comprises the following steps:

s1, analyzing amino acid sites possibly combined with biomacromolecule modification based on protein structural domains, mutating to obtain a mutation library, and panning;

s2, carrying out high-throughput sequencing on the mutation library and the library obtained by panning, processing data obtained by sequencing to obtain a mutant sequence and constructing an antibody-like predictive model data set; the antibody-like predictive model data set comprises an antibody-like classification model and a regression model data set;

s3, constructing a class antibody classification prediction model and a regression prediction model based on a convolutional neural network, and respectively training the constructed classification prediction model and the constructed regression prediction model by utilizing the constructed classification model and a regression model data set;

S4, scoring the mutant sequence obtained in the step S2 by using a trained two-classification prediction model and a trained regression prediction model respectively, keeping the score higher than that of the wild type mutant sequence, counting enriched amino acids at each mutation position of the mutant sequence, and arranging and combining the amino acids according to the sequence of the mutation positions to generate a sequence to be predicted;

s5, calculating conservation of amino acid at each mutation position in the sequence to be predicted, and grouping the sequence to be predicted according to the amino acid at the least-conservative mutation position; and (3) recalculating the conservation of the amino acid at each mutation position of the sequence to be predicted of each group, determining the most conserved amino acid at each position of each group, and sequentially combining the most conserved amino acids to form a sequence which is the predicted antibody.

As one embodiment, when a class of antibodies that can bind to a peptide containing a tyrosine phosphorylation modification is predicted based on the Fyn SH2 domain, the mutated 8 amino acid positions correspond to Lys19, glu41, thr42, thr43, ala46, ser48, lys63, and Lys66, respectively, of the parent Fyn SH2 domain (shown in SEQ ID No. 1).

Specifically, a mutant library obtained by mutation based on SH2 domain was panned using a peptide containing tyrosine phosphorylation modification.

Specifically, in the method of the present invention, the processing of the data obtained by sequencing in step S2 includes a preliminary processing and a denoising processing; wherein, the preliminary treatment is as follows: splicing the DNA sequences obtained by converting the high-throughput sequencing data, performing quality control, setting the quality control value of a single base to be not lower than 20, reserving the DNA sequences with the base ratio which meets the quality control value of not less than 90%, and translating the reserved DNA sequences into amino acid sequences; the denoising process is as follows: deleting amino acid sequences having a copy number of less than 3 and not present in all panning rounds; the mutant sequence is obtained after treatment.

Specifically, when the DNA sequence is translated into an amino acid sequence, the stop codon is also translated; wherein the codon TAG is translated into q; the codon TAA translates to x; the codon TGA translates to w.

Specifically, amino acids at each mutation site on the mutation sequence in the library before and after panning are extracted and sequentially connected into peptide fragments for constructing a quasi-antibody prediction model data set.

Specifically, the two classification prediction model data set in step S2 includes two columns of data, an amino acid sequence and a tag column, wherein the amino acid sequence and the tag column sequentially comprise amino acids at mutation positions of a mutant sequence; the value of the tag column is defined as 0 or 1; the method for defining the tag value of the amino acid sequence is as follows:

(1) Calculating the ratio of the amino acid sequence copy number to the sum of the total copy numbers of the panning round sequences respectively;

(2) Calculating the difference of the duty ratio of the amino acid sequence in the data of the two rounds before and after panning; the following tag definition is performed on the amino acid sequence according to the duty ratio difference:

if the duty cycle after panning of an amino acid sequence is greater than the duty cycle before panning, the tag value of the amino acid sequence is defined as 1, being a positive sample; otherwise, 0 is a negative sample; when the number of the positive and negative samples is unequal, selecting from the two ends of the maximum value and the minimum value of the ratio to the middle so that the number of the positive and negative samples is equal;

the regression model dataset comprises two columns of data, an amino acid sequence and a tag column, wherein the amino acid sequence and the tag column sequentially consist of amino acids at mutation positions according to a mutant sequence; the method for defining the value of the tag column is as follows:

(1) Calculating the ratio of the copy number of the amino acid sequence in the sum of the total copy number of each round of sequence;

(2) Calculating the ratio of the duty ratio after panning to the duty ratio before panning, and carrying out log10 operation on the ratio; when the duty ratio is 0, setting the copy number of the sequence with the copy number of 0 to be 10, and recalculating the duty ratio;

(3) Dividing the label value into two types according to a log10 operation result by taking 0 as a boundary, wherein the number of sequences of the small number of types is n, the number of sequences of the large number of types is m, and the sequences of the small number of types are incorporated into a regression prediction model data set; selecting the same number of sequences from the multiple number of classes, and also incorporating the sequences into the regression prediction model dataset; the selection method comprises the following steps: the m sequences are arranged according to the size of the tag value, the m sequences are equally divided into n parts, one sequence is randomly taken from each part, and finally n sequences are obtained and are included in the regression prediction model data set.

Specifically, the method for constructing the classification prediction model and the regression prediction model in step S3 is as follows:

(1) Determining feature codes: inputting the feature vectors generated after feature coding into an input layer, testing a common protein coding mode by using an algorithm, and evaluating the performances of different feature codes by using a cross verification method; selecting the code with the optimal performance as the characteristic code of the data set;

(2) Constructing input layer, convolution layer, full connection layer and output layer models: the model is only different in the number of layers of the convolution layers, and the number of layers of the model convolution layers is 2-6;

the convolution kernel size of the first convolution layer in the convolution layers is fixed to be 1, the convolution kernels of the other convolution layers are fixed to be 3, the number of convolution kernels of all the convolution layers is fixed to be 64, one dropout layer is added after all the convolution layers, and parameters are fixed to be 0.5;

the number of all connection sublayers in the all connection layers is fixed to be 3, the number of units of the last all connection sublayer is set to be 1, the number of units of the rest all connection sublayers is fixed to be 32, a dropout layer is added at the back, the parameters are fixed to be 0.5, and a global average pooling layer is arranged in front of the last all connection sublayer;

the output layer comprises a neuron, and uses sigmoid as an activation function; the part of the early stop strategy is fixed at 100, the batch_size value is fixed at 256, and the epoch parameter is fixed at 1000.

Specifically, when the input layer, the convolution layer, the full connection layer and the output layer are constructed, the performance of different models is evaluated through a cross-validation method, and a model frame with the best performance is selected for further optimization.

Specifically, the optimizing includes:

(1) Convolution kernel number optimization: the convolution kernel size of each convolution layer is set to 64 or 128, respectively, wherein the convolution kernel size of the convolution layer of the next layer is not smaller than that of the previous layer. The number of convolution kernels per convolution layer is determined by evaluating the different models by cross-validation.

(2) Convolution kernel size optimization: the convolution kernel sizes of the convolution layers other than the first layer are set to 2 to 10, respectively, wherein the convolution kernel size of the latter layer is not smaller than the convolution kernel size of the former layer. The different models are evaluated by cross-validation to determine the convolution kernel size for each convolution layer.

(3) Maximum pooling layer optimization of convolutional layers: a max pooling layer is added after each convolution layer in turn from back to front. The number of added max-pooling layers is determined by evaluating the different models by cross-validation.

(4) Full connection sub-layer number optimization: the number of fully connected sublayers is determined by setting the fully connected sublayers to 3 to 7, respectively, and evaluating the different models by a cross-validation method.

(5) Optimizing the number of full-connection sub-layer units: the number of full-connection sub-layer units except for the last layer is set to 4, 8, 32, 64, 128 or 256, respectively, wherein the number of full-connection sub-layer units of the previous layer is not smaller than the number of full-connection sub-layer units of the next layer. Evaluating the different models by cross-validation to determine the number of units per fully connected sub-layer

(6) Maximum pooling layer optimization of full connection layer: in addition to the last two fully connected sublayers, the other sublayers are followed by a maximum pooling layer that is evaluated by cross-validation to determine if the maximum pooling layer needs to be added.

(7) dropout layer parameter optimization: each dropout layer parameter is set to 0.2 to 0.8, respectively. The different models are evaluated by cross-validation to determine each dropout layer parameter.

(8) Early-stop strategy: the parameters value is set to 30, 50, 70, 90, 110, 130 or 150, respectively. The different models are evaluated by cross-validation to determine the optimum value of the contribution.

(9) batch_size optimization: the batch_size value is set to 32, 64, 128, 256, or 512, respectively. The different models were evaluated by cross-validation to determine the batch_size optimum.

Specifically, the invention uses One-Hot encoding to convert the amino acid sequence in the data set into a digital matrix which can be identified by the CNN model; in One-Hot encoding, 20 amino acids and encoded as a 20-dimensional binary vector, in the vector corresponding to the amino acid, the element associated with the amino acid is labeled 1, and the other elements are labeled 0.

Specifically, in step S4, when the retention score is higher than the wild type mutant sequence, the threshold of the two-classification prediction model may be set to 0.5, and the retention score of the two-classification prediction model is higher than 0.5 and the regression prediction model score is higher than the wild type sequence.

Specifically, in step S4, when determining the enriched amino acids, the ratio of amino acids at each mutation position is counted, and amino acids having a ratio of more than 0.05 at each position are defined as the enriched amino acids.

Meanwhile, a Two-Sample-Logo graph can be drawn, and the enriched amino acid obtained by the graph is also included.

Specifically, the rule for generating the sequence to be predicted in step S4 is as follows: one position in each mutation position is allowed to be a common amino acid, and the rest positions are required to be enriched amino acids. In the present invention, there are 8 mutation positions based on SH2 domain, wherein 1 mutation position is common amino acid (20), and the rest positions are enriched amino acids, thus 5,664,001 sequences to be predicted are generated.

Specifically, if the number of the generated sequences to be predicted is large, the generated sequences to be predicted can be scored by using the trained two-classification prediction model and the trained regression prediction model respectively, and the scores of the two-classification prediction model and the two-classification regression prediction model are kept higher than that of the wild sequences to be predicted.

Specifically, if the number of the classification prediction model and the regression prediction model with the score higher than that of the wild type sequences to be predicted is still more, the classification prediction model and the regression prediction model can be further filtered. Since the lower the score of the two-class prediction model and the regression prediction model, the lower the affinity thereof, the sequence to be predicted, which retains the two-class prediction model score >0.9 and the regression prediction model score >1, can be selected and referred to as a candidate class antibody.

The invention also provides a system for predicting the antibody-like body combined with the biological macromolecule modification based on the protein structural domain, which comprises a high-throughput sequencing data processing module, an antibody-like body prediction model evaluation module and a prediction sequence output module;

the high-throughput sequencing data processing module is used for processing high-throughput sequencing data, constructing a quasi-antibody classification model and regression model data set, and submitting the data set to the quasi-antibody prediction model evaluation module;

the antibody-like prediction model evaluation module comprises a classification prediction model and a regression prediction model of the antibody, firstly, the input classification model and regression model data set are utilized to train the classification prediction model and the regression prediction model, then the sequence to be evaluated is input into the classification prediction model and the regression prediction model, and the classification prediction model score and the regression prediction model score are output;

The prediction sequence output module screens the sequence according to the input classification prediction model score and regression prediction model score, reserves the sequence with the score higher than that of the wild type, counts the enriched amino acids at each mutation position of the sequence, arranges and combines the amino acids according to the sequence of the mutation position, generates a sequence to be predicted, calculates the conservation of the amino acids at each mutation position in the sequence to be predicted, and groups the sequence to be predicted according to the amino acid at the least conserved mutation position; the conservation of the amino acid at each mutation position of the sequence to be predicted in each group is recalculated, the most conserved amino acid at each position in each group is determined, and the most conserved amino acids are sequentially combined to form and output the predicted antibody-like sequence.

The invention also provides a class antibody combined with peptide containing tyrosine phosphorylation modification, wherein the class antibody is predicted by the method of the invention; the amino acid sequence of the antibody is shown in any one of SEQ ID NO. 2-11.

Specifically, the antibody is M_4V2, M_4V4, M_4R1, M_4R2, M_4R4, M_4F1, M_4F2, M_ N, M _4V1 and M_4F2, and the amino acid sequences of the antibody are shown in SEQ ID NO. 2-11.

The invention claims the use of said antibody-like binding to a peptide comprising a tyrosine phosphorylation modification for the preparation of a product for detecting a peptide comprising a tyrosine phosphorylation modification.

In particular, the detection of tyrosine phosphorylation modified peptides includes both qualitative and quantitative detection, i.e., can also be used to detect the concentration of tyrosine phosphorylation modified peptides.

In view of the fact that the antibody-like body bound to the peptide having the tyrosine phosphorylation modification according to the present invention can be used for detecting the peptide having the tyrosine phosphorylation modification, the DNA sequence corresponding thereto, that is, the DNA sequence encoding the antibody-like body according to the present invention can be obtained based on the amino acid sequence provided by the present invention, thereby obtaining the antibody-like body according to the present invention by recombinant expression or the like. Thus, the invention also claims DNA sequences encoding the antibody-like bodies of the invention.

The invention also claims a vector comprising the above-mentioned DNA sequence encoding the antibody-like body of the invention.

The invention has the following beneficial effects:

the invention provides a method for predicting a class antibody combined with biomacromolecule modification based on a protein domain, which comprises the steps of constructing a protein domain mutation library, panning, performing high-throughput sequencing, processing high-throughput sequencing data, constructing a class antibody prediction model training data set, constructing a class antibody prediction model based on a convolutional neural network, training, screening candidate class antibody exploration feature sequences by using the prediction model, and the like. The method can be used for exploring the sequence characteristics of the antibody, realizing the rapid screening of the antibody, being beneficial to finding the antibody with high affinity which is difficult to find by phage display or biopanning technology, having the advantages of low cost and high precision and having important significance for the prediction and screening of the antibody.

Drawings

FIG. 1 is a flow chart of obtaining a biomolecular modified binding antibody using the method of the present invention.

FIG. 2 is the position of the 8 amino acid sites on the Fyn SH2 domain that bind to pTyr.

FIG. 3 shows the change in OD450 signals during biopanning of SH2 mutant libraries.

Fig. 4 is a TSL diagram of a bi-classification predictive model positive and negative sample dataset.

FIG. 5 is a network architecture diagram of an antibody-like predictive model.

FIG. 6 is a plot of regression models versus predicted scores and true values for independent test sets.

FIG. 7 is a Sequence Logo of potential pY class antibodies.

FIG. 8 is a Sequence Logo of the Sequence of an amino acid at position P4 in a fixed position; A-I in the figure are Sequence Logo diagrams of sequences of R (A), N (B), F (C), D (D), T (E), E (F), V (G), W (H) and I (I) of amino acids fixed at the 4 th position in Sequence.

FIG. 9 shows the results of ELISA experiments for predicting the class of antibodies and the phosphorylated peptide fragment pYEEI.

FIG. 10 shows ELISA results of the antibody-like and phosphorylated peptide fragment pYEEI for comparison.

FIG. 11 is a graph showing the results of an affinity test for predicting the class of antibodies and the phosphorylated peptide fragment pYEEI; A-L in the figure are the affinity test results of the predicted antibodies M_4V2, M_4V3 (Trm), M_4V4, M_4R1, M_4R2, M_4R3 (V13), M_4R4, M_4F1, M_4S1, M_ N, M _4N1, M_4N2 and the phosphorylated peptide fragment pYEEI in sequence.

FIG. 12 shows the results of an affinity test for the predicted antibody class and phosphorylated peptide fragment GGpYGG; the A-L in the figure are the affinity test results of the predicted antibodies M_4V2, M_4V3 (Trm), M_4V4, M_4R1, M_4R2, M_4R3 (V13), M_4R4, M_4F1, M_4S1, M_ N, M _4N1, M_4N2 and the phosphorylated peptide fragment GGpYGG in sequence.

Detailed Description

The invention is further illustrated in the following drawings and specific examples, which are not intended to limit the invention in any way. Unless specifically stated otherwise, the reagents, methods and apparatus employed in the present invention are those conventional in the art.

Reagents and materials used in the following examples are commercially available unless otherwise specified.

The reliability of the method for predicting the antibody-like body combined with the biological macromolecule modification based on the protein structural domain is demonstrated by taking the prediction of the pY antibody-like body as an example. The flow chart of obtaining the antibody-like body combined with the biological macromolecule modification by the method is shown in figure 1, and comprises the following 5 steps: collecting and processing data, constructing a quasi-antibody prediction model, predicting quasi-antibodies, exploring quasi-antibody sequence characteristics and verifying experiments.

Example 1 construction of class 1 antibody predictive model training dataset

The mutation library constructed in the present reference (Li S, et al, review the phosphotyrosine binding pocket of Fyn SH2 domain led to the identification of novel SH protein Sci,2021,30 (3): 558-70.) was used as training data to train a model for predicting pY-type antibodies using the mutation library and high throughput sequencing data of the library obtained after panning. Specifically, the mutation library is obtained by randomly mutating 8 amino acid sites on the Fyn SH2 domain, and the positions of the 8 amino acid sites on the Fyn SH2 domain, which are combined with pTyr, are shown in FIG. 2. Based on this, biopanning experiments were performed using phage display technology, and affinity tests were performed with synthetic peptides containing pTyr, which were used to experimentally verify the affinity of Fyn SH2 mutants, as shown in table 1, with the N-terminal biotinylation and C-terminal amidation of the peptides.

TABLE 1 synthetic peptides containing pTyr for experimental verification of Fyn SH2 mutant affinity

Note that: "a": 6-amino acetic acid; "amide": amidated C-terminal (-CONH) ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the "pY": phosphorylated tyrosine

And aiming at the created mutation library and the panned library, performing high-throughput sequencing by using an illuminea second-generation high-throughput sequencing platform, and processing data obtained by the high-throughput sequencing to construct a training data set, called a data set for short, of a pY antibody prediction model.

The data from high throughput sequencing cannot be immediately used for training of antibody-like predictive models, and need to be processed first to convert it to a complete DNA sequence. On the basis, a data set for training a class antibody prediction model is constructed, and the data set comprises preliminary treatment and denoising treatment, and is specifically as follows:

1. preliminary processing of high throughput sequencing data;

(1) Splicing forward and backward sequencing results on the same sequence; splicing two sections of sequencing results which are positioned on the same sequence and successfully paired by using flash software, and obtaining a section of sequence with the same length as the sequencing sequence after finishing the operation;

(2) Judging the direction of the spliced sequences; taking the 5 'end to the 3' end of the wild sequence as a forward direction, defining a reverse complementary sequence as a reverse direction, and judging the direction of the sequence of each round after panning;

(3) Performing reverse complementation operation on the reverse sequence; if the spliced sequence is a reverse sequence, performing reverse complementary operation on the sequence, and adjusting the sequence to be a forward sequence;

(4) Quality control; performing quality control operation on the spliced sequences by using a fastq_quality_filter sub-tool in fastx_tool software, wherein the quality control is adapted to the following commands: "fastq_quality_filter-Q20-p 90-input. Fq-o out fq-Q33"; the meaning of the command is: only a DNA sequence with a single base matrix control value of more than 20 and a base ratio of not less than 90% meeting the base matrix control value on a single sequence is reserved;

(5) Translation of the DNA sequence into an amino acid sequence; translating the DNA sequence into an amino acid sequence according to the triple codon principle; it should be noted that: in order to keep the sequence information to the greatest extent, a corresponding translation operation is also carried out on the stop codon; the translation rules for the stop codon are defined as: the codon TAG is translated to q; the codon TAA is translated to x; codon TGA is translated into w;

(6) Calculating the number of amino acid sequences; the number of each amino acid sequence was calculated.

After completion of the high throughput sequencing data processing of the SH2 mutation library and each panning round, DNA sequences of length 240 and amino acid sequences of length 80 were obtained. To facilitate representation of the panning rounds, the original mutant library is denoted Round0 and the panning of the first Round through the fourth Round is denoted Round1 through Round4, respectively. The data statistics of the SH2 mutant sequences obtained after the SH2 mutant library and the SH2 panning round sequencing data are processed are shown in Table 2.

TABLE 2 statistical results of SH2 mutants after sequencing data processing of mutant library and panned library

	Round0	Round1	Round2	Round3	Round4
						Number of protein sequence types	823,190	749,341	522,458	131,107	49,211
Protein sequence copy number	1,571,441	1,582,204	1,791,148	4,509,187	3,331,589
						Average copy number of protein sequences	1.91	2.11	3.43	34.40	67.70

The average copy number of a protein sequence in Table 2 is obtained by dividing the copy number of the protein sequence by the number of kinds of the protein sequence. The average copy number of the protein may reflect whether dominant mutants (mutants that are more prone to bind to pY) are present during the biopanning process. From an examination of the data in table 2, it can be seen that: as biopanning proceeds, the number of species of protein sequences gradually decreases, and the number of copies of the protein sequences tends to increase. The average copy number remained essentially unchanged between rounds from Round0 to Round2 (table 2), increasing progressively between rounds from Round2 to Round4, indicating that the pY class antibody sequence progressively gained growth advantage between Round2 and Round 4.

2. Denoising process

The denoising process of the data is performed according to the following standard:

(1) Deleting the sequence containing the stop codon;

(2) Amino acid sequences with copy numbers less than 3 and not present in all panning rounds were deleted.

And denoising the data in Round0 to Round4 according to the denoising standard. The data statistics of SH2 mutants in each round of panning after denoising are shown in Table 3.

TABLE 3 statistical results of data for SH2 mutants after denoising of mutant library and panning library

	Round0	Round1	Round2	Round3	Round4
						Number of protein sequence types	87,334	101,369	97,991	49,436	23,514
Protein sequence copy number	635,797	724,718	1,239,083	4,400,205	3,299,612
						Average copy number of protein sequences	7.28	7.15	12.64	89.01	140.33

As can be seen from the comparison of the data in tables 2 and 3, the average copy number of the protein sequences before and after denoising has a strong correlation (Pearson correlation coefficient: 0.99), i.e., the average copy number of the protein sequences gradually increases as the number of sequence types in the library gradually decreases with the progress of panning rounds. The average copy number was essentially unchanged from Round0 to Round2, and gradually increased in rounds after Round2, reflecting the success of the denoising process.

Amino acids at 8 mutation sites of SH2 mutant sequences before and after panning are extracted, and peptide segments with the length of 8 amino acids are sequentially connected and used for constructing an antibody-like predictive model data set.

3. Construction of antibody-like predictive model data sets

The antibody-like predictive model comprises a classification predictive model and a regression predictive model. A description of these two predictive models and their datasets is shown below:

two-class prediction model and data set thereof: the two-class prediction model is used for predicting the probability of specific binding of SH2 mutants and pY peptide fragments; the dataset is constructed based on the difference in magnitude of increase between the two rounds.

Antibody-like regression prediction model: the quasi-antibody regression prediction model is used for predicting the change condition of the SH2 mutant copy number; the dataset is constructed based on the growth amplitude ratio data between the two rounds.

A higher score for the mutant two-class predictive model indicates a higher probability of specific binding of the mutant to pY; the higher score of the mutant regression prediction model represents that the increase amplitude of the copy number of the mutant is high between rounds; mutants with two high scores were predicted to have higher affinity for the pY potential class antibodies.

Furthermore, literature (Li, s., et al (2021), "Revisiting the phosphotyrosine binding pocket of Fyn SH2 domain led to the identification of novel SH2 superbinders":Protein Science30 (3) 558-570) OD450 signal measurement was performed on the library before and after each panning round, and the change result of OD450 signal during biopanning of SH2 mutant library is shown in FIG. 3. From the graph, the OD450 signal does not change substantially between Round0 and Round 2; the OD450 signal between Round2 and Round4 increases gradually, corresponding to the change in average copy sequence in tables 1 and 2. When training the antibody-like predictive model, only two rounds of data are needed to train the model, and the sequences in the Round2 and the Round4 are selected as training data of the antibody-like predictive model because the dominant sequences change obviously between the Round2 and the Round4 in the biological panning process.

(1) Construction of two-class predictive model data set

The training dataset of the two-class predictive model includes two columns of data: an amino acid sequence of length 8 (sequence column) and a tag column of the sequence (difference column, indicated by 0 or 1); a1 in the tag column indicates that the mutant sequence of SH2 is capable of binding to pY-modified peptide fragments, and vice versa, denoted as 0. The sequences in the training data are tag defined (i.e., the values in the difference column) as follows:

1) Firstly, respectively calculating the ratio of the sequence copy numbers of SH2 mutant sequences corresponding to amino acid sequences in Round4 and Round2 to the sum of the copy numbers of all mutants in the current Round;

2) Calculating the difference between the ratio of the SH2 mutant sequence corresponding to the amino acid sequence in Round4 and the ratio of the SH2 mutant sequence in Round2, namely subtracting the ratio of the SH2 mutant sequence in Round2 from the ratio of the SH2 mutant sequence corresponding to the amino acid sequence in Round 4;

3) After calculating the difference between the ratio of Round4 and the ratio of Round2 of the SH2 mutant sequence corresponding to an amino acid sequence, the sequence is subjected to label definition according to the following standard:

if the ratio difference of the amino acid sequence in the two panning rounds is greater than 0, the difference value of the sequence is defined as 1, which represents that the mutant sequence can be combined with the pY peptide fragment; whereas 0 indicates that the mutant sequence cannot specifically bind to the pY peptide fragment.

And (5) performing label definition on the sequence obtained by processing according to the method. A total of 102,746 sequences, wherein the difference value of 8460 sequences is greater than 0, and the sequence is a positive sample; the difference value of the remaining 94,286 sequences is less than or equal to 0 and is a negative sample. And selecting from the two ends of the maximum value and the minimum value of the duty ratio to the middle so that the positive sample quantity and the negative sample quantity are equal. The positive samples are all taken out and are incorporated into the data set of the antibody-like predictive model in order to ensure the balance of the number of positive and negative samples in the data set; and (3) taking 8461 sequences from the sequence with the minimum difference value in the negative sample data to the upstream, and incorporating the sequences into the data set of the antibody-like predictive model. A training dataset of the two-class predictive model is thus obtained.

To view the difference in features between the positive and negative Sample data sets, the present invention plots Two-Sample-Logo (TSL) graphs (P <0.05, t-test with Bonferroni correction) for the positive and negative Sample data sets, as shown in FIG. 4. As can be seen from fig. 4, there is a significant characteristic difference between the positive sample capable of specifically binding to the pY peptide fragment and the negative sample incapable of binding to the pY peptide fragment, and the maximum difference value is 59.9%. In positive samples, glutamate (E) was significantly enriched at mutation position 1 and mutation position 2; serine is significantly enriched at mutation position 3; valine and alanine were significantly enriched at mutation position 6; lysine was significantly enriched at mutation position 7; leucine was significantly enriched at mutation position 8. In the negative sample threonine was significantly enriched at mutation position 4; serine is significantly enriched at mutation position 6; lysine was significantly enriched at mutation position 8. The apparent difference between positive and negative samples reflects the predictability of antibody-like bodies.

To better examine the predictive power of the class antibody two-class predictive model, 4/5 of the two-class predictive model dataset was randomly selected for use as a cross-validation set (cross-validation dataset); the remaining 1/5 was used as independent test set (independent test) (Table 4).

TABLE 4 statistical classification of prediction model datasets

(2) Construction of regression prediction model dataset

The training dataset of the regression prediction model includes two columns of data: an amino acid sequence of length 8 (sequence column) and a tag column of sequence (log (ratio) column); the values in the log (ratio) column are calculated as follows:

1) Firstly, respectively calculating the ratio of the total copy number of the sequence copy numbers in Round4 and Round2 of SH2 mutant sequences corresponding to amino acid sequences in respective rounds;

2) Calculating the ratio of the sequence ratio of the SH2 mutant sequence corresponding to the amino acid sequence in Round4 to the sequence ratio in Round2, and performing log10 operation; a value greater than 0 indicates that the duty cycle of the sequence from Round2 to Round4 is greater, whereas a value indicating that the duty cycle of the sequence from Round2 to Round4 is smaller. In calculating the ratio, in order to avoid the case where the denominator is 0, the following processing is performed: when the denominator is 0, it is assigned to 10 and recalculated. That is, when the duty ratio is 0, the copy number of the sequence whose copy number is 0 is set to 10 and the duty ratio is recalculated.

3) Dividing the label value into two types according to a log10 operation result by taking 0 as a boundary, wherein the number of sequences of the small number of types is n, the number of sequences of the large number of types is m, and the sequences of the small number of types are incorporated into a regression prediction model data set; selecting the same number of sequences from the multiple number of classes, and also incorporating the sequences into the regression prediction model dataset; the selection method comprises the following steps: the m sequences are arranged according to the size of the tag value, the m sequences are equally divided into n parts, one sequence is randomly taken from each part, and finally n sequences are obtained and are included in the regression prediction model data set.

As calculated, there are 2253 sequences with log (ratio) greater than 0 in the case of Round2 and Round4, and the log (ratio) of the remaining 100,493 sequences is less than 0, thereby constructing a regression prediction model dataset, specifically as follows:

(1) incorporating all sequences of log (ratio) >0 into the regression prediction model dataset;

(2) the sequences of log (ratio) <0 were arranged in descending order of log (ratio) values, and divided into 2253 parts. Randomly taking a sequence from each part, and taking 2253 pieces of data into a data set of a regression prediction model; a dataset for constructing the regression prediction model is thus obtained.

The regression model dataset was divided into training and independent test sets with a quantitative ratio of 4:1. The dividing method comprises the following steps: the datasets were sorted in descending order by the value of log (ratio) columns, 5 being 1; taking 1 sequence from each part as an independent test set; the remainder are training sets (Table 5).

TABLE 5 class antibody regression prediction model dataset statistics

Training set	Independent test set
		3606	904

Example 2 construction of class 2 antibody predictive model and performance evaluation thereof

Deep learning is a function in machine learning, and the invention finally searches for the pY antibody capable of specifically binding with the high affinity of the pY peptide based on a deep learning algorithm, and predicts the pY antibody according to the change of various mutants in the biopanning process.

Deep learning is often in the form of a multi-layer neural network. Most neural networks consist essentially of the following core principles:

(1) Linear units and nonlinear units are used alternately, which are often referred to as "layers".

(2) The parameters of the network are updated using the chain law (i.e. back propagation).

The convolutional neural network (Convolutional Neural Network, CNN) is a commonly used deep learning algorithm, is a feed-forward neural network, and consists of a plurality of pooling layers and convolutional layers. The basic structure of CNN consists of an Input layer (Input layer), a convolution layer (Convolutional layer), a Pooling layer (Pooling layer), a full-connection layer (Fully connected layer), and an Output layer (Output layer). The input layer is used for receiving training data for model training; data for model training is often converted into a form of a digital matrix. The convolutional layer functions to extract features present in the input data. The pooling layer can sample the features extracted by the convolution layer, so that the redundancy of data is greatly reduced, and the problem of excessive fitting of a model is reduced. The full-connection layer is used for summarizing information processed by each network layer in the earlier stage, and finally outputting the result through the output layer. An activation function will typically be present in the output layer; the activation function may map the output result to within a specified range.

1. Construction of antibody-like predictive model

The antibody-like prediction model comprises a classification prediction model and a regression prediction model, which are constructed based on CNN, and CNN network structures of the two models are completely consistent. Specifically, the invention uses specific sequence feature codes to complete the construction of a prediction model and perform performance test in combination with a CNN network architecture, and the network architecture diagram used by the invention is shown in FIG. 5.

The CNN network architecture used is described as follows:

(1) Input layer: and an input layer of the feature vector generated after feature encoding.

(2) Convolutional layer: comprises four convolution sublayers and two maximum pooling layers; each convolution sub-layer has 128 convolution kernels, 1, 3, 9 and 10 in size, respectively. The positions of the two largest pooling layers are behind the third convolution sub-layer and the fourth convolution sub-layer, respectively.

(3) Fully connected layer: the layer consists of four full-connection layers, two maximum pooling layers and a global average pooling layer, and the combination form is as follows: the back of the first two full connection layers is followed by a maximum pooling layer; the third fully connected layer is followed by a global averaging pooling layer; the last full connection layer is located at the last position. The number of units of the full link layer is 64, 32, 8 and 1, respectively, in order.

(4) Output layer: the layer contains a neuron, outputting a fraction. Activated by the sigmoid activation function, a value in the range of 0 to 1 is output.

The present invention uses One-Hot encoding to convert amino acid sequences in a dataset into a digital matrix that can be recognized by the CNN model. In One-Hot encoding, 20 amino acids and are encoded as a 20-dimensional binary vector. In the vector corresponding to an amino acid, the element associated with the amino acid is labeled 1, and the other elements are labeled 0. For example, "a" is represented by "10000000000000000000" and "C" is represented by "01000000000000000000".

2. Predictive model performance assessment

(1) Performance evaluation index of two-class prediction model

In the classification task, the sequences may be classified according to a combination of a real class and a predicted class, which are respectively: true Positive (TP), false Positive (FP), true Negative (TN), false Negative (FN). Accordingly, performance evaluation index values such as Sensitivity (SN), specificity (SP), overall Accuracy (ACC), and the coefficient of the correlation with the horse (MCC) can be further calculated. The invention also calculates the Area Under ROC Curve value (AUC) to evaluate the overall performance of the two-class predictive model.

(2) Regression prediction model performance evaluation index

The general evaluation indexes of the regression model mainly include: mean absolute error (Mean Absolute Error; MAE), mean square error (Mean Square Error; MSE), goodness of fit (R-Squared; R2).

The invention uses convolutional neural network as prediction model algorithm and uses One-Hot as coding mode of input data to construct two kinds of prediction models. The function of the two-class prediction model is to predict whether a given sequence will bind specifically to pY. The predictive model is trained in a ten-fold cross-validation training method using a cross-validation set, and then the performance of the predictive model after training is independently tested using an independent data set. The performance statistics of the predictive model on ten fold cross validation and independent testing are shown in table 6 below.

TABLE 6 statistical table of model performance of the bifurcated predictive model

	SN	SP	ACC	MCC	AUC
						Cross validation of ten folds	0.899±0.015	0.957±0.0127	0.929±0.007	0.859±0.014	0.966±0.005
Independent testing	0.891±0.007	0.963±0.004	0.927±0.002	0.857±0.004	0.968±0.002

The regression model was constructed using a network architecture consistent with the dichotomous model, trained using the training set, and then independently tested using the independent dataset, with the results shown in table 7.

TABLE 7 regression model Performance statistics

R ²	MAE	MSE
			0.7704	0.1911	0.0786

To further determine the accuracy of regression model prediction after training is completed, the method The invention calculates the correlation between the predictive score (pre_log) and the true value (Log) of the independent test set, and the relation scatter diagram between the predictive score and the true value of the independent test set is shown in fig. 6. As can be seen from fig. 6, the fitting ability (R ² ) 0.7704, the overall fitting ability was excellent.

Example 3A method of predicting antibody-like binding to biomacromolecule modification based on protein Domain

After the quasi-antibody prediction model is constructed and training is completed, sequence characteristics can be explored by using the quasi-antibody prediction model, and the quasi-antibody modified by biological macromolecules can be predicted. After the prediction is completed, the corresponding antibody-like sequences are obtained according to the sequence characteristics, and the predicted antibody-like sequences are evaluated by using a Protein enzyme-linked immunosorbent assay (Protein ELISA) and an affinity assay.

The process of exploring sequence features and predicting antibody-like bodies based on sequence features is as follows:

mutants that are more prone to binding specifically to pY tend to show a tendency to increase in the ratio of each panning round during biopanning of SH2 mutant libraries using pY peptide fragments. In order to find the sequence characteristics of the pY antibody, amino acids with enrichment trend at each mutation position in the panning process are firstly found out; the enriched amino acids are arranged and combined according to the sequence of mutation positions to generate an amino acid sequence to be predicted with the length of 8.

The determination method of the enriched amino acid is as follows:

(1) scoring all denoised sequencing sequences by using a classification model score and a regression model;

(2) selecting sequences with scores greater than a threshold value; threshold value: the score of the two-classification prediction model is greater than 0.5; the score of the regression prediction model is greater than that of the Fyn SH2 wild type;

(3) counting the ratio of amino acids at each mutation position;

(4) amino acids with a ratio of greater than 0.05 at each position (High frequency amino acid) and enriched amino acids in the TSL map (TSL enriched amino acid) are defined as enriched amino acids.

Generating a sequence to be predicted: after the enriched amino acids of each mutation position are determined, the sequences to be predicted are generated by arranging and combining according to the sequence of the mutation positions. The specific rules are as follows: one of the 8 positions is allowed to be a common amino acid (20), and the remaining 7 positions must be enriched for amino acids. Based on the method, 5,664,001 sequences are generated, and attention is paid to the fact that the class antibody model score is higher than that of the Fyn SH2 wild type sequence, so that the generated sequence to be predicted is scored again by the trained class antibody prediction model. It was predicted that the score of 7466 sequences was higher than that of the wild type (wild type classification model score: 0.0012; regression model score: 0.2787), where 4971 sequences to be predicted were present that were not present in the sequencing data.

To select for high affinity pY class antibodies (lower both model scores, lower affinity), the 7466 sequences were sequence filtered. The filtering standard is as follows: prediction classification of classification model>0.9 and regression model prediction score>1,293 sequences were selected and designated as candidate class antibodies. Notably, the super parent reported in the literature (Kaneko, t., et al (2012), "super binder SH2 domains act as antagonists of cell signaling":Science signaling 5(243):ra68.；Li,S.,et al.(2021)."Revisiting the phosphotyrosine binding pocket of Fyn SH2 domain led to the identification of novel SH2 superbinders."Protein Science30 558-570), such as SH2 superbinder, V3, V10, V13 and V24 are present in potential pY class antibodies. In addition, 108 of the candidate class antibodies were not present in the sequencing data.

Since the number of candidate antibodies is still large, the Sequence Logo conservation analysis is further performed on the candidate antibodies, and the result is shown in fig. 7. As can be seen from FIG. 7, the conservation at the P4 position is the lowest, while the conservation at the rest positions is relatively high. Because the preference of the amino acid at the P4 position cannot be determined, 293 candidate antibodies are split according to the amino acid at the P4 position, 9 groups are obtained, and Sequence Logo analysis is performed again for each group of candidate antibodies, and the result is shown in FIG. 8. As can be seen from FIG. 8, the amino acids at each position in each set of sequences are more conserved, and the sequences with conserved amino acids at each position are considered to be more likely to be antibody-like. The definition of conserved amino acids is as follows: is the highest one to two amino acids at the mutation position. And determining the most conserved amino acid at each mutation position of each group, and sequentially combining the most conserved amino acids to form a sequence which is the predicted antibody.

Based on the method, the invention lists 18 sequences of conserved amino acids, randomly selects 4 candidate antibody classes from 293 as comparison, obtains corresponding antibody class sequences according to sequence characteristics, and carries out ELISA and affinity determination experiments. The sequences of 18 conserved amino acids and their affinity measurements are summarized in table 8, and these sequences were finally predicted as SH 2-type antibodies, designated m_4 (where 4 is the 4 th mutation position and amino acid at this position), and the predicted SH 2-type antibodies included 4 SH 2-type antibodies (Trim, V13, V50, and V3) that have been reported. For comparison, 4 randomly selected from 293 candidate class antibodies, designated as c_4, were designated as comparison mutants, and the sequences and affinity measurements are shown in table 9.

Table 8 SH2 predictive antibody sequence Listing

¹ Name naming principle: m represents the mutant, 4 represents the 4 th mutation position, the letter at the 4 th position represents the amino acid at the 4 th position, and the numerical value at the following represents the number of sequences within the group. Such as M_4T1, is the first member of the group of mutants to be T at position 4.

² KD columns represent the KD values of the antibody-like sequences and the polypeptide pYEEI.

The absence of measurement (low ELISA value) in the table means that the sequence was low in ELISA detection and no further affinity measurement was performed. The non-measurement is that since the affinity of the antibody-like sequences having the same sequence characteristics is equivalent, it is unnecessary to test all the predicted antibody-like sequences having the same sequence characteristics, and it is meaningless that a few sequences are not measured.

TABLE 9 randomly selected antibody-like sequence listing

¹ Name naming principle: c represents the mutant used for comparison (compactison), 4 represents the 4 th position, the letter at the 4 th position represents the amino acid at the 4 th position, and the numerical values at the following represent the sequence numbers.

ELISA experiments were performed on the predicted class antibodies (shown in Table 8) and the comparison class antibodies (shown in Table 9) respectively using the phosphorylated peptide fragment pYEEI, and Wild Type (WT) and mutant Trim were used as controls. The results of ELISA experiments for the predicted class antibodies and for comparison the class antibodies and the phosphorylated peptide fragment pYEEI are shown in FIGS. 9 and 10, respectively. For mutants with ELISA signals higher than WT, further affinity assays with the two peptide fragments pYEEI and GGpYGG were performed using Biacore. The predicted affinity measurements of the antibody-like and phosphorylated peptide fragments pYEEI and GGpYGG are shown in fig. 11 and 12, respectively. From the results shown in tables 8, 9, 11 and 12, the predicted affinity of the class antibodies was much higher than that of the mutants used for comparison, similar to that of the reported class antibodies. Furthermore, the affinity of the antibody-like bodies used for comparison was mostly higher than that of WT. The result shows that the artificial intelligence-based algorithm for predicting the antibody class is effective and reliable, and the method for predicting the antibody class combined with biomacromolecule modification can be used for obtaining the antibody class with high affinity.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A method for predicting antibody-like binding to a biomacromolecule modification based on a protein domain, comprising the steps of:

2. The method of claim 1, wherein the processing of the data obtained from the sequencing in step S2 comprises a preliminary process and a denoising process; wherein, the preliminary treatment is as follows: splicing the DNA sequences obtained by converting the high-throughput sequencing data, performing quality control, setting the quality control value of a single base to be not lower than 20, reserving the DNA sequences with the base ratio which meets the quality control value of not less than 90%, and translating the reserved DNA sequences into amino acid sequences; the denoising process is as follows: deleting amino acid sequences having a copy number of less than 3 and not present in all panning rounds; the mutant sequence is obtained after treatment.

3. The method of claim 1, wherein the bifurcated predictive model dataset of step S2 includes two columns of data: a data column and a tag column; the data column is an amino acid sequence consisting of amino acids at the mutation positions in the mutant sequence, and the value of the tag column is defined as 0 or 1; the method for defining the tag value of the amino acid sequence is as follows:

(1) Calculating the ratio of the copy number of the amino acid sequence in the sum of the total copy numbers of the panning round sequences respectively;

the regression model dataset includes two columns of data: a data column and a tag column; the data sequence is an amino acid sequence consisting of amino acids at the mutation positions according to the mutant sequence; the method for defining the value of the tag column is as follows:

4. The method according to claim 1, wherein the method for constructing the classification prediction model and the regression prediction model in step S3 is as follows:

5. The method according to claim 1, wherein in step S4, amino acids with a ratio of more than 0.05 at each position and enriched amino acids in the drawn Two-Sample-Logo plot are defined as enriched amino acids.

6. The system for predicting the antibody-like body combined with the biological macromolecule modification based on the protein structural domain is characterized by comprising a high-throughput sequencing data processing module, an antibody-like body prediction model evaluation module and a prediction sequence output module;

7. A class of antibodies that bind to a peptide comprising a tyrosine phosphorylation modification, wherein the class of antibodies is predicted by the method of claim 1; the amino acid sequence of the antibody is shown in any one of SEQ ID NO. 2-11.

8. A DNA sequence encoding the antibody of claim 7.

9. A vector comprising the DNA sequence of claim 8.

10. Use of the antibody of claim 7 for the preparation of a product for detecting a peptide comprising a tyrosine phosphorylation modification.