CN115132278A

CN115132278A - Method, device, equipment and storage medium for modifying antibody species

Info

Publication number: CN115132278A
Application number: CN202210599920.4A
Authority: CN
Inventors: 蒋彪彬; 许振雷; 刘伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-30
Anticipated expiration: 2042-05-27
Also published as: CN115132278B

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for modifying antibody species, and related embodiments can be applied to various scenes such as artificial intelligence and the like for improving the modification efficiency of the antibody species. The method comprises the following steps: when the human-like index corresponding to the antibody sequence to be modified is smaller than an index threshold value, covering amino acids at each position in the antibody sequence to be modified in sequence, obtaining P first candidate human-like modified amino acids corresponding to each covering position through a humanized antibody pre-training model, replacing the amino acids at the covering positions with the P first candidate human-like modified amino acids corresponding to each covering position respectively to obtain P first candidate human-like modified antibody sequences, outputting the human-like index corresponding to each first candidate human-like modified antibody sequence through a species discrimination model, and determining the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a target modified antibody sequence if the maximum value of the human-like index is larger than or equal to the index threshold value.

Description

Method, device, equipment and storage medium for modifying antibody species

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for modifying antibody species.

Background

With the development of modern biophysics, computational structure biology and information technology, humanized antibody design has become an important link in the development process of current therapeutic antibodies as a new development trend.

In the development process of current therapeutic antibodies, mouse-derived antibodies are most commonly used, however, after the antibodies generated by mice are injected into human bodies as drugs, strong immune rejection reaction is generated, so that the human bodies generate "Anti-Drug antibodies" (ADA), and if the mouse-derived antibodies are removed from the human bodies, the Antibody drugs cannot generate corresponding effects, so that the humanized design of the antibodies plays an important role.

At present, antibody species are generally predicted through a random forest model, then an antibody sequence is searched and modified one by one through a greedy algorithm, however, the greedy algorithm used in the modification process needs to exhaust amino acid substitution at each position one by one, and a large amount of time needs to be consumed to complete the humanized modification of the antibody, so that the efficiency of modifying the humanized antibody species is not high.

Disclosure of Invention

The embodiment of the application provides an antibody species transformation method, device, equipment and storage medium, which are used for rapidly and accurately positioning to an amino acid position suitable for transformation through a human source antibody pre-training model, can use a suitable candidate human transformation amino acid for transformation, do not need to consume a large amount of time to carry out one-by-one exhaustion on amino acid substitution of each position, and can improve the efficiency and rationality of antibody species humanized transformation.

In one aspect, the embodiments of the present application provide a method for modifying antibody species, including:

inputting the antibody sequence to be modified into a species discrimination model, and outputting a human-like index corresponding to the antibody sequence to be modified through the species discrimination model;

when the human-like index corresponding to the antibody sequence to be modified is smaller than an index threshold value, covering amino acids at each position in the antibody sequence to be modified in sequence to obtain a first covering antibody sequence corresponding to each covering position;

inputting the first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and acquiring P first candidate human modified amino acids corresponding to each masking position through the human antibody pre-training model, wherein P is an integer greater than 0;

respectively replacing amino acids at the covering positions with P first candidate human-like modified amino acids corresponding to each covering position to obtain P first candidate human-like modified antibody sequences corresponding to each covering position;

inputting the P first candidate human-like modified antibody sequences corresponding to each covering position into a species discrimination model respectively, and outputting a human-like index corresponding to each first candidate human-like modified antibody sequence through the species discrimination model;

if the maximum numerical value of the human-like index is greater than or equal to the index threshold, determining a first candidate human-like modified antibody sequence corresponding to the maximum numerical value of the human-like index as a target modified antibody sequence, and marking a covering position corresponding to the target modified antibody sequence as a modified position;

and if the maximum value of the human-like index is smaller than the index threshold, the steps of sequentially covering the amino acid at each position in the antibody sequence to be modified and obtaining a first covering antibody sequence corresponding to each covering position are repeated until the maximum value of the human-like index is larger than or equal to the index threshold.

Another aspect of the present application provides a device for engineering multiple antibody species, comprising:

the acquisition unit is used for inputting the antibody sequence to be modified into the species discrimination model and outputting the human-like index corresponding to the antibody sequence to be modified through the species discrimination model;

the processing unit is used for sequentially covering amino acids at each position in the antibody sequence to be modified when the human-like index corresponding to the antibody sequence to be modified is smaller than an index threshold value, so as to obtain a first covering antibody sequence corresponding to each covering position;

the acquisition unit is further used for inputting the first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and acquiring P first candidate human modified amino acids corresponding to each masking position through the human antibody pre-training model, wherein P is an integer larger than 0;

the processing unit is further used for replacing the amino acids at the covering positions with the P first candidate human-like modified amino acids corresponding to each covering position respectively to obtain P first candidate human-like modified antibody sequences corresponding to each covering position;

the acquisition unit is also used for respectively inputting the P first candidate human-like modified antibody sequences corresponding to each covering position into a species discrimination model, and outputting a human-like index corresponding to each first candidate human-like modified antibody sequence through the species discrimination model;

a determining unit, configured to determine, if a maximum value of the human-like index is greater than or equal to an index threshold, a first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a target modified antibody sequence, and mark a masking position corresponding to the target modified antibody sequence as a modified position;

and the determining unit is further used for returning to the step of sequentially masking the amino acids at each position in the antibody sequence to be modified if the maximum value of the human-like index is smaller than the index threshold value, and obtaining a first masked antibody sequence corresponding to each masked position, and continuing to execute the step until the maximum value of the human-like index is larger than or equal to the index threshold value.

In one possible design, in one implementation of another aspect of an embodiment of the present application,

the processing unit is further used for marking the covering position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a modified position if the maximum value of the human-like index is smaller than the index threshold value, and covering amino acids at each position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index in sequence to obtain a second covering antibody sequence corresponding to each covering position;

the acquisition unit is further used for inputting the second masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and acquiring P second candidate human-like modified amino acids corresponding to each masking position through the human antibody pre-training model;

a processing unit, further used for replacing the amino acid at the masking position with P second candidate human-like engineered amino acids for each masking position in other masking positions except the engineered position, respectively, to obtain P second candidate human-like engineered antibody sequences corresponding to each masking position;

the acquisition unit is further used for respectively inputting the P second candidate human-like modified antibody sequences into a species discrimination model aiming at each masking position in other masking positions except the modified position, and outputting a human-like index corresponding to each second candidate human-like modified antibody sequence through the species discrimination model;

and the determining unit is further used for determining a second candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as the target modified antibody sequence and marking the covering position corresponding to the target modified antibody sequence as the modified position if the maximum value of the human-like index is greater than or equal to the index threshold.

In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the obtaining unit may be specifically configured to:

inputting the first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and outputting a first type of human modified amino acid set corresponding to each masking position through the human antibody pre-training model;

and selecting P first candidate human-like modified amino acids from each first human-like modified amino acid set based on the modification probability value corresponding to each first human-like modified amino acid.

inputting the second masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and outputting a second type of human modified amino acid set corresponding to each masking position through the human antibody pre-training model;

and selecting P second candidate human modified amino acids from each second human modified amino acid set respectively based on the modification probability value corresponding to each second human modified amino acid.

In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:

acquiring a human antibody sequence set;

inputting the human antibody sequence set into a basic human antibody pre-training model, and outputting a human antibody vector corresponding to each human antibody sequence in the human antibody sequence set, an amino acid vector corresponding to each amino acid in each human antibody sequence and a prediction probability value through the human antibody pre-training model;

calculating a first predicted loss value based on the humanized antibody vector, the amino acid vector and the predicted probability value;

and adjusting parameters of the basic human antibody pre-training model based on the first predicted loss value to obtain a human antibody pre-training model.

inputting the sequence of the antibody to be modified into a species discrimination model, and acquiring a vector of the antibody to be modified corresponding to the sequence of the antibody to be modified through a language identification network;

and transmitting the antibody vector to be modified to a species discriminator, and outputting a human-like index through the species discriminator.

acquiring a candidate species tag antibody sequence set, wherein the candidate species tag antibody sequence set comprises candidate species tag antibody sequences and species tags, and the number of the candidate species tag antibody sequences corresponding to each species tag is the same;

inputting the candidate species tag antibody sequence set into a language recognition network, and outputting a candidate species tag antibody vector corresponding to each candidate species tag antibody sequence, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence and a species category probability value through the language recognition network;

calculating a species classification loss value based on the candidate species tag antibody vector, the amino acid vector, and the species class probability value;

and adjusting parameters of the basic classifier based on the species classification loss value and the species label to obtain the species discriminator.

the acquisition unit is also used for acquiring a candidate unlabeled antibody sequence set;

and the processing unit is also used for pre-training the basic language identification network based on the candidate unlabeled antibody sequence set and the mask strategy to obtain the language identification network.

respectively performing high-frequency mask learning on the variable region of each candidate unlabeled antibody sequence based on a first mask learning rate and performing low-frequency mask learning on the non-variable region of each candidate unlabeled antibody sequence based on a second mask learning rate through a basic language identification network, wherein the first mask learning rate is greater than the second mask learning rate;

calculating the mask learning error rate of the basic language identification network to the candidate unlabeled antibody sequence set;

when the mask learning error rate is less than or equal to the learning threshold, determining the current underlying language identification network as the language identification network.

downloading a set of unlabeled antibody sequences from an antibody database;

and screening candidate non-tag antibody sequences from the non-tag antibody sequence set based on the occurrence frequency of each non-tag antibody sequence in the non-tag antibody sequence set so as to obtain the candidate non-tag antibody sequence set.

Another aspect of the application provides a computer device, including: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

the processor, when executing the program in the memory, implements the methods as described above;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

According to the technical scheme, the embodiment of the application has the following beneficial effects:

outputting a human-like index corresponding to an antibody sequence to be modified through a species discrimination model, when the human-like index is smaller than an index threshold value, sequentially masking amino acids at each position in the antibody sequence to be modified to obtain a first masked antibody sequence corresponding to each masked position, further, obtaining P first candidate human-like modified amino acids corresponding to each masked position through a human-derived antibody pre-training model, respectively replacing the amino acids at the masked positions with the P first candidate human-like modified amino acids corresponding to each masked position to obtain P first candidate human-like modified antibody sequences corresponding to each masked position, then obtaining the human-like index corresponding to each first candidate human-like modified antibody sequence through the species discrimination model, and when the maximum value of the human-like index is larger than or equal to the index threshold value, determining the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a target modified antibody sequence, and marking the covering position corresponding to the target modified antibody sequence as a modified position, and when the maximum numerical value of the human-like index is smaller than the index threshold, returning to the step of covering the amino acid at each position in the antibody sequence to be modified in sequence to obtain a first covering antibody sequence corresponding to each covering position, and continuing to execute the step until the maximum numerical value of the human-like index is larger than or equal to the index threshold. Through the mode, P first candidate class human transformation amino acids corresponding to each covering position can be obtained through the human source antibody pre-training model, the human index corresponding to each first candidate class human transformation antibody sequence is combined, the amino acid position suitable for transformation can be rapidly and accurately positioned, the amino acid suitable for candidate class human transformation can be used for transformation, the amino acid replacement of each position is not required to be exhausted one by one due to the fact that a large amount of time is consumed, and therefore the efficiency and the reasonability of antibody species humanized transformation can be improved.

Drawings

FIG. 1 is a schematic diagram of an architecture of an antibody species control system in an embodiment of the present application;

FIG. 2 is a flow chart of one embodiment of a method for engineering antibody species in the examples of the present application;

FIG. 3 is a flow chart of another embodiment of a method of engineering antibody species in the examples of the present application;

FIG. 4 is a flow chart of another embodiment of a method of engineering an antibody species in the examples of the present application;

FIG. 5 is a flow chart of another embodiment of a method of engineering an antibody species in the examples of the present application;

FIG. 6 is a flow chart of another embodiment of a method of engineering an antibody species in the examples of the present application;

FIG. 7 is a flow chart of another embodiment of a method of engineering an antibody species in the examples of the present application;

FIG. 8 is a flow chart of another example of a method for engineering an antibody species in the examples of the present application;

FIG. 9 is a flow chart of another example of a method for engineering an antibody species in the examples of the present application;

FIG. 10 is a flow chart of another embodiment of a method of engineering antibody species in the examples of the present application;

FIG. 11 is a flow chart of another embodiment of a method of engineering an antibody species in the examples of the present application;

FIG. 12 is a schematic flow chart of one embodiment of the method for engineering antibody species in the present application;

FIG. 13 is a schematic representation of the heavy and light chains of one of the Fv domains of the antibody species of the methods of engineering of the antibody species of the examples herein;

FIG. 14(a) is a schematic diagram showing the difference in antibody species in the method for engineering the antibody species in the examples of the present application;

FIG. 14(b) is a schematic diagram of the engineering of antibody species for the methods of engineering antibody species in the examples of the present application;

FIG. 15 is a schematic representation of an antibody species prediction interface of a method of engineering antibody species in the examples of the present application;

FIG. 16 is a schematic representation of an antibody species engineering interface of a method of engineering antibody species in the examples of the present application;

FIG. 17 is a schematic diagram of a prediction of antibody drug species for a method of engineering antibody species in the examples of the present application;

FIG. 18(a) is a schematic representation of the correlation evaluation of a human-like index with the immunogenicity of an antibody drug in a method for engineering antibody species in the examples of the present application;

FIG. 18(b) is a schematic representation of another human-like index of the method of engineering antibody species in the examples of the present application for correlation assessment of immunogenicity of antibody drugs;

FIG. 19 is a schematic diagram of a low dimensional spatial antibody species engineering of the methods of engineering antibody species in the examples of the present application;

FIG. 20 is a schematic view of one embodiment of the apparatus for engineering antibody species in the examples of the present application;

FIG. 21 is a schematic diagram of an embodiment of a computer device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims and drawings of the present application, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Some concepts related to the embodiments of the present application are described below.

1. Antigens

Antigens are molecules that can induce an immune response in an organism and are referred to as antigens. The antigen may be derived from outside the organism, such as a protein on a new coronavirus; or may be a protein with a mutation produced in vivo, for example, by a tumor cell.

2. Antibodies

Antibodies are protein molecules that are produced by B cells in an organism upon stimulation by an antigen and which bind to the antigen to elicit an immune response that eliminates the antigen. The sequence of the antibody protein as shown in figure 13 consists of 20 amino acids which form a macromolecule with 3D structure and biological activity by folding. The most common antibodies consist of two identical heavy chains (heavy chain) and two identical light chains (light chain). The domain responsible for antigen binding is called the Fv region, and the corresponding heavy chain fragment is labeled VH, about 120 amino acids long, and the corresponding light chain VL, about 110 amino acids long.

3. Variable region

The variable-region (CDR) is also called CDR region, as shown in fig. 13, is a potential region for binding antibody and antigen, and has a strong flexible structure. B cells producing different antibodies can change the amino acid of the region through VDJ gene recombination and somatic hypermutation, thereby enhancing the binding capacity with antigen. The heavy and light chains each have three CDR regions, designated CDR-H1 (or CDR-L1), CDR-H2 (or CDR-L2) and CDR-H3 (or CDR-L3).

4. Non-variable region

The invariant Region (FWR) is also called FWR Region, and as shown in fig. 13, it is a structural FrameWork Region of the antibody, and has stable structure and plays a role in supporting the whole body. The immutable regions of different antibodies have strong similarity, and the sequences and structures are highly conserved in evolution and are not easy to change, so the antibody is named. The heavy and light chains each have four FWR regions, labeled FW-H1 (or FW-L1), FW-H2 (or FW-L2), FW-H3(FW-L3) and FW-H4 (FW-L4).

5. Antibody-encoding gene

The antibody encoding gene can be encoded by V, D and J genes in the heavy chain Fv region of the antibody, and the light chain has only V and J genes.

6. Natural language model

The natural language model is used for recognizing, understanding and generating a large number of language words of human beings by converting the language words into machine language through a statistical model. Specific applications include machine translation, automatic question answering and the like.

7. Pre-training

The pre-training is to train a language model through a large amount of unmarked language texts to obtain a set of model parameters, further initialize the model by using the set of parameters to realize the hot start of the model, and then finely adjust the parameters on the framework of the existing language model according to specific tasks to fit the label data provided by the tasks. The pre-training method has been proven to have good results in both classification and labeling tasks of natural language processing.

8. Species of antibody origin

Antibodies are macromolecular proteins produced in an organism by antigenic stimulation. Antibodies are usually produced by most higher organisms, and the sequences of different species are different, and as shown in fig. 14(a), common antibodies are derived from humans, mice (mouse), rats (rat), monkeys, alpacas, chickens, rabbits, and sharks.

9. Antibody immunogenicity

In the development of antibody drugs, animals are usually immunized and antibodies are extracted and isolated. Most commonly used are antibodies of mouse origin. However, after the Antibody generated by the mouse is injected into a human body as a Drug, a strong immunological rejection reaction is generated, so that the human body generates an "Anti-Drug Antibody" (ADA), and if the mouse-derived Antibody is removed from the human body, the Antibody Drug cannot generate a corresponding effect.

10 antibody humanization engineering

The antibody humanization modification is a process flow for reducing the immunological rejection of an antibody in a human body by modifying the sequence of a non-human antibody by amino acid modification, as shown in fig. 14(b), so that the human immune system cannot recognize the non-human antibody as a heterologous antibody, and the process flow is called humanization modification.

11. Antibody human index (humanness)

The antibody-to-human index refers to the probability of whether an antibody sequence is derived from a human, and ranges from 0 to 1.

It will be appreciated that in the embodiments of the present application, data relating to the sequences and collections of antibodies from species, etc., when the above embodiments of the present application are applied to specific products or technologies, user approval or consent is required, and the collection, use and handling of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions.

It is understood that the methods for engineering antibody species as disclosed herein involve Artificial Intelligence (AI) techniques, which are further described below. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Second, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Secondly, Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and development of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service and the like.

It should be understood that the antibody species modification method provided herein can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, etc., for obtaining a humanized antibody species by modifying the antibody species to be applied to the scenarios of experiment or development of antibody drugs, etc.

In order to solve the above problems, the present application provides a method for modifying an antibody species, which is applied to an antibody species control system shown in fig. 1, please refer to fig. 1, fig. 1 is a schematic structural diagram of the antibody species control system in the embodiment of the present application, as shown in fig. 1, a server outputs a human-like index corresponding to an antibody sequence to be modified, which is provided by a terminal device through a species discrimination model, and when the human-like index is smaller than an index threshold, sequentially masks amino acids at each position in the antibody sequence to be modified to obtain a first masking antibody sequence corresponding to each masking position, and further, may obtain P first candidate human-like modified amino acids corresponding to each masking position through a human-derived antibody pre-training model, and use the P first candidate human-like modified amino acids corresponding to each masking position to respectively replace the amino acids at the masking position, and when the maximum value of the human-like index is greater than or equal to an index threshold value, determining the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a target modified antibody sequence, and marking the covering position corresponding to the target modified antibody sequence as a modified position. Through the mode, P first candidate class human transformation amino acids corresponding to each covering position can be obtained through the human source antibody pre-training model, the human index corresponding to each first candidate class human transformation antibody sequence is combined, the amino acid position suitable for transformation can be rapidly and accurately positioned, the amino acid suitable for candidate class human transformation can be used for transformation, the amino acid replacement of each position is not required to be exhausted one by one due to the fact that a large amount of time is consumed, and therefore the efficiency and the reasonability of antibody species humanized transformation can be improved.

It can be understood that fig. 1 only shows one terminal device, and in an actual scene, a greater variety of terminal devices may participate in the data processing process, where the terminal devices include, but are not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and the specific number and variety depend on the actual scene, and are not limited herein. In addition, fig. 1 shows one server, but in an actual scenario, a plurality of servers may participate, and particularly in a scenario of multi-model training interaction, the number of servers depends on the actual scenario, and is not limited herein.

It should be noted that in this embodiment, the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and an artificial intelligence platform, and the like. The terminal device and the server may be directly or indirectly connected through a wired or wireless communication manner, and the terminal device and the server may be connected to form a block chain network, which is not limited herein.

In conjunction with the above description, the method for modifying antibody species in the present application will be described below, and referring to fig. 2, one embodiment of the method for modifying antibody species in the present application comprises:

in step S101, the antibody sequence to be modified is input to a species discrimination model, and a human-like index corresponding to the antibody sequence to be modified is output through the species discrimination model;

in this embodiment, as shown in fig. 12, a non-human antibody sequence and a human identity may be collected through a target object to obtain an antibody sequence to be modified, then the obtained antibody sequence to be modified may be input into a pre-trained species discrimination model for species prediction, and a human-like index corresponding to the antibody sequence to be modified may be output through the species discrimination model.

The target object may be represented by a pharmaceutical factory, a worker in an antibody species research and development laboratory, and the like, and is not particularly limited herein. The antibody sequence to be modified may be specifically represented by a non-human antibody sequence obtained by immunizing an animal, or may be represented by a human antibody sequence, which is not specifically limited herein. The species discrimination model may be specifically represented as a model obtained by pretraining based on the natural language model BERT as illustrated in fig. 12, and may also be represented as another model, which is not specifically limited herein, and the species discrimination model may be used for a model for performing classification prediction on an antibody sequence to be modified. The human-like index refers to the probability of whether an antibody sequence to be engineered is derived from a human, and ranges from 0 to 1.

In particular, a target object (e.g., a pharmaceutical factory) can be collected from a natural antibody library and downloaded to a series of antibody sequences to be engineered, and the antibody sequence to be modified can be submitted to the antibody species prediction interface shown in fig. 15, and the target object can carry out classification prediction on the obtained antibody sequence to be modified by clicking a submit button, such as six-classification prediction, i.e., species prediction for humans (human), camels (camel), rats (rat), apes (rhesus), mice (mouse), and non-antibodies (not antibodies), then, the prediction result of each species corresponding to the antibody sequence to be modified can be output, as shown in FIG. 15, the probability value displayed beside the species is the probability value of the species to which the antibody sequence to be modified belongs, for example, the probability beside human is human-like index, so that the species source of the antibody sequence to be modified can be determined based on the obtained human-like index corresponding to the antibody sequence to be modified.

In step S102, when the human-like index corresponding to the antibody sequence to be modified is smaller than the index threshold, masking the amino acids at each position in the antibody sequence to be modified in sequence to obtain a first masked antibody sequence corresponding to each masked position;

in this embodiment, after the human-like index corresponding to the antibody sequence to be modified is obtained, when the human-like index is smaller than an index threshold, it may be understood that the species source of the antibody sequence to be modified is a non-human antibody sequence, amino acids at each position in the antibody sequence to be modified may be sequentially masked, so as to obtain a first masked antibody sequence corresponding to each masked position, so that the amino acid position suitable for modification may be further predicted based on the first masked antibody sequence, thereby improving the accuracy of antibody species humanization modification to a certain extent.

The index threshold is a parameter set according to actual application requirements, and is used for judging whether the antibody sequence to be modified belongs to a humanized antibody sequence, and a specific numerical value is not limited here and can be set to be 0.999.

Specifically, after the human-like index corresponding to the antibody sequence to be modified is obtained, whether the antibody sequence to be modified needs to be subjected to humanized modification can be determined by judging whether the human-like index corresponding to the antibody sequence to be modified meets a humanized condition, specifically, the modified antibody sequence can be submitted to an antibody species modification interface shown in fig. 16, the human-like index corresponding to the antibody sequence to be modified is compared with a preset index threshold, when the human-like index is smaller than the index threshold, it can be understood that the species source of the antibody sequence to be modified is a non-human antibody sequence, and then the amino acids at each position in the antibody sequence to be modified can be sequentially masked, so as to obtain a first masked antibody sequence corresponding to each masked position.

For example, the antibody sequence to be engineered is AAA, and assuming that AAA corresponds to a human-like index of 0.3, the amino acids at each position in the antibody sequence to be engineered may be masked sequentially from the first a, such as masking the first a to obtain a corresponding first masked antibody sequence, such as XAA sequence, and similarly, the first masked antibody sequences corresponding to other masked positions, such as AXA sequence and AAX sequence, may be obtained sequentially.

Further, when the human-like index is greater than or equal to the index threshold, it can be understood that the species source of the antibody sequence to be engineered is a human antibody sequence, and the antibody sequence to be engineered does not need to be engineered.

In step S103, inputting the first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and obtaining P first candidate human-like modified amino acids corresponding to each masking position through the human antibody pre-training model, where P is an integer greater than 0;

in this embodiment, after the first masking antibody sequence corresponding to each masking position is obtained, the first masking antibody sequence corresponding to each masking position may be respectively input into a human antibody pre-training model, and the human amino acid prediction is performed on the first masking antibody sequence by the human antibody pre-training model to predict amino acids suitable for modification at each masking position, that is, P first candidate human-like modified amino acids.

The human antibody pre-training model can be represented as other models based on a model obtained by pre-training a natural language model BERT, and is not particularly limited herein.

Specifically, the first masking antibody sequence corresponding to each masking position is input into a human antibody pre-training model, the P first candidate human modified amino acids corresponding to each masking position are obtained through the human antibody pre-training model, specifically, the first candidate human modified amino acid set corresponding to each masking position is output through the human antibody pre-training model, and the P first candidate human modified amino acids are respectively selected from each first human modified amino acid set based on the modification probability value corresponding to each first human modified amino acid.

For example, assuming that a first masking antibody sequence is XAA, a plurality of human amino acids that can be substituted at the masking position X can be predicted by predicting the XAA sequence with a human antibody pre-training model, and then P (e.g., 3) first candidate human modified amino acids such as B, C and D having a larger modification probability value can be selected from the plurality of human amino acids based on the modification probability value corresponding to each first human modified amino acid.

In step S104, the P first candidate human engineered amino acids corresponding to each masking position are used to replace the amino acids at the masking position, respectively, so as to obtain P first candidate human engineered antibody sequences corresponding to each masking position;

in this embodiment, after obtaining P first candidate human-like engineered amino acids corresponding to each mask position, the amino acids at the mask positions may be replaced by the P first candidate human-like engineered amino acids corresponding to each mask position, respectively, to obtain P first candidate human-like engineered antibody sequences corresponding to each mask position after replacement.

Specifically, after P first candidate human engineered amino acids corresponding to each mask position are obtained, for example, P (e.g., 3) first candidate human engineered amino acids with a higher value of engineering probability are selected from the multiple human amino acids and are B, C and D respectively, the 3 first candidate human engineered amino acids can be used to replace the amino acid X at the mask position to realize the engineering of the first mask antibody sequence XAA, and the 3 first candidate human engineered antibody sequences corresponding to the mask position X, such as BAA, CAA and DAA, can be obtained.

In step S105, the P first candidate human-like engineered antibody sequences corresponding to each masking position are respectively input to a species discrimination model, and a human-like index corresponding to each first candidate human-like engineered antibody sequence is output through the species discrimination model;

in this embodiment, after the P first candidate human-like engineered antibody sequences corresponding to each masking position are obtained, the P first candidate human-like engineered antibody sequences corresponding to each masking position may be respectively input to a species discrimination model for species prediction, and a human-like index corresponding to each first candidate human-like engineered antibody sequence is output through the species discrimination model.

Specifically, after P first candidate human-like modified antibody sequences corresponding to each masking position are obtained, the P first candidate human-like modified antibody sequences corresponding to each masking position may be respectively input to a species discrimination model for species prediction, a manner similar to that in which the antibody sequence to be modified is input to the species discrimination model in step S101, and a human-like index corresponding to the antibody sequence to be modified is output through the species discrimination model may be adopted, which is not described here any more, so as to obtain a human-like index corresponding to each first candidate human-like modified antibody sequence.

For example, it is assumed that 3 first candidate human-like engineered antibody sequences such as BAA, CAA and DAA corresponding to the acquired masking position X are output through the species discrimination model, and the human-like indices corresponding to each first candidate human-like engineered antibody sequence are 0.9001, 0.9991 and 0.8005 respectively, similarly, it is assumed that 3 first candidate human-like engineered antibody sequences such as ABA, AEA and AFA corresponding to another masking position are output through the species discrimination model, and the human-like indices corresponding to each first candidate human-like engineered antibody sequence are 0.8771, 0.9553 and 0.8664 respectively, and it is assumed that 3 first candidate human-like engineered antibody sequences such as AAC, AAR and AAD corresponding to another masking position are output through the species discrimination model, and the human-like indices corresponding to each first candidate human-like engineered antibody sequence are 0.8121, 0.9442 and 0.8999 respectively.

In step S106, if the maximum value of the human-like index is greater than or equal to the index threshold, the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index is determined as the target engineered antibody sequence, and the mask position corresponding to the target engineered antibody sequence is marked as the engineered position.

In this embodiment, after the human-like index corresponding to each first candidate human-like modified antibody sequence is obtained, when the maximum value of the human-like index is greater than or equal to the index threshold, it may be understood that the species source of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index is a human antibody sequence, and no modification is needed, the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index may be determined as the target modified antibody sequence, and the masked position corresponding to the target modified antibody sequence is marked as a modified position, so that the target object may accurately sense the amino acid position where modification occurs.

Specifically, after the human-like index corresponding to each first candidate human-like engineered antibody sequence is obtained, whether the first candidate human-like engineered antibody sequence needs to be humanized or not can be determined by judging whether the maximum numerical value of the human-like index corresponding to each first candidate human-like engineered antibody sequence meets the humanization condition, specifically, the maximum numerical value of the human-like index corresponding to the first candidate human-like engineered antibody sequence can be obtained by comparing the human-like indexes corresponding to each first candidate human-like engineered antibody sequence in pairs, then, the obtained maximum numerical value can be compared with an index threshold value, when the maximum numerical value of the human-like index is greater than or equal to the index threshold value, it can be understood that the species source of the first candidate human-like engineered antibody sequence is the human-derived antibody sequence, then, the first candidate human-like engineered antibody sequence corresponding to the maximum numerical value of the human-like index can be determined as the target engineered antibody sequence, meanwhile, the covering position corresponding to the target modified antibody sequence is marked as a modified position, the target modified antibody sequence, the modified position and the corresponding human-like index can be displayed in real time through an antibody species modification interface shown in fig. 16, so that the antibody species humanization modification of the antibody sequence to be modified is completed, the modification condition of the antibody sequence to be modified can be reflected in real time through the antibody species modification interface shown in fig. 16, a target object can know the effect of the antibody sequence to be modified in time conveniently, and user experience is improved.

For example, assuming that the human-like indexes corresponding to each of the first candidate human-like engineered antibody sequences, such as BAA, CAA, DAA, ABA, AEA, AFA, AAC, AAR, and AAD, are 0.9001, 0.9991, 0.8005, 0.8771, 0.9553, 0.8664, 0.8121, 0.9442, and 0.8999 respectively, the maximum value is 0.9991 by comparing the human-like indexes two by two, the maximum value 0.9991 can be compared with a preset index threshold value of 0.999, and the maximum value 0.9991 is greater than the preset index threshold value of 0.999, the first candidate human-like engineered antibody sequence CAA can be determined as the target engineered antibody sequence, and the mask position C corresponding to the target engineered antibody sequence is marked as the engineered position.

In step S107, if the maximum value of the human-like index is smaller than the index threshold, the step of sequentially masking the amino acids at each position in the antibody sequence to be modified is repeated to obtain the first masked antibody sequence corresponding to each masked position, and the step is continuously performed until the maximum value of the human-like index is greater than or equal to the index threshold.

In this embodiment, after the human-like index corresponding to each first candidate human-like engineered antibody sequence is obtained, when the maximum value of the human-like index is smaller than the index threshold, it may be understood that the species source of the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index is still a non-human antibody sequence, and the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index still needs to be engineered, and then the step of sequentially masking the amino acids at each position in the antibody sequence to be engineered may be repeated, and the step of obtaining the first masked antibody sequence corresponding to each masked position is continuously performed until the maximum value of the human-like index is greater than or equal to the index threshold.

Specifically, after the human-like index corresponding to each first candidate human-like modified antibody sequence is obtained, whether the first candidate human-like modified antibody sequence needs to be subjected to humanized modification or not can be determined by judging whether the maximum numerical value of the human-like index corresponding to each first candidate human-like modified antibody sequence meets the humanized condition or not, and when the maximum numerical value of the human-like index is smaller than an index threshold value, it can be understood that the first candidate human-like modified antibody sequence corresponding to the maximum numerical value of the human-like index needs to be further modified.

Further, the step of sequentially masking the amino acid at each position in the antibody sequence to be modified to obtain the first masked antibody sequence corresponding to each masked position may be repeated, the antibody species humanization modification may be further performed on the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index until the maximum value of the human-like index is greater than or equal to the index threshold, specifically, the masked position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index may be marked as a modified position, the amino acid at each position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index may be sequentially masked to obtain the second masked antibody sequence corresponding to each masked position, and then, the second masked antibody sequence corresponding to each masked position may be input to the human-derived antibody pre-training model, obtaining P second candidate human-like modified amino acids corresponding to each covering position through a human antibody pre-training model, replacing the amino acids on the covering positions with P second candidate human-like modified amino acids respectively aiming at each covering position in other covering positions except the modified positions so as to obtain P second candidate human-like modified antibody sequences corresponding to each covering position, respectively inputting the P second candidate human-like modified antibody sequences into a species discrimination model aiming at each covering position in other covering positions except the modified positions, respectively outputting a human-like index corresponding to each second candidate human-like modified antibody sequence through the species discrimination model, and then determining the second candidate human-like modified antibody sequence corresponding to the maximum human-like index as a target modified antibody sequence when the maximum human-like index corresponding to the second candidate human-like modified antibody sequence is greater than or equal to an index threshold value, and marking the corresponding covered position of the target modified antibody sequence as the modified position.

Conversely, if the human-like index corresponding to the second candidate human-like engineered antibody sequence is less than the index threshold, then antibody species humanization may be further performed on the second candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index until the maximum value of the human-like index is greater than or equal to the index threshold.

In the embodiment of the application, the method for modifying the antibody species is provided, through the mode, P first candidate human-like modified amino acids corresponding to each covering position can be obtained through a human source antibody pre-training model, and by combining human-like indexes corresponding to each first candidate human-like modified antibody sequence, the amino acid positions suitable for modification can be rapidly and accurately positioned, the amino acid suitable for candidate human-like modification can be used for modification, a large amount of time is not required to be consumed for carrying out one-by-one exhaustion on amino acid substitution of each position, and therefore the efficiency and the reasonability of humanized modification on the antibody species can be improved.

Alternatively, on the basis of the above embodiment corresponding to fig. 2, in another optional embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 3, if the maximum value of the human-like index is smaller than the index threshold in step S107, the step of repeating to sequentially mask the amino acids at each position in the antibody sequence to be modified and obtaining the first masked antibody sequence corresponding to each masked position is continuously performed until the maximum value of the human-like index is greater than or equal to the index threshold, including:

in step S301, if the maximum value of the human-like index is smaller than the index threshold, the masking position of the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index is marked as an engineered position, and amino acids at each position of the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index are sequentially masked to obtain a second masking antibody sequence corresponding to each masking position;

in this embodiment, after the human-like index corresponding to each first candidate human-like modified antibody sequence is obtained, when the maximum value of the human-like index is smaller than the index threshold, it may be understood that the species source of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index is still the non-human antibody sequence, and the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index needs to be further modified, then the amino acids at each position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index may be sequentially masked, so as to obtain the second masked antibody sequence corresponding to each masked position.

Specifically, after the class human index corresponding to each first candidate class human engineered antibody sequence is obtained, whether the first candidate class human engineered antibody sequence needs to be humanized and engineered may be determined by determining whether the maximum numerical value of the class human index corresponding to each first candidate class human engineered antibody sequence meets the humanization condition, specifically, the class human index corresponding to each first candidate class human engineered antibody sequence is pairwise compared to obtain the maximum numerical value of the class human index corresponding to the first candidate class human engineered antibody sequence, then, the obtained maximum numerical value may be compared with an index threshold, when the maximum numerical value of the class human index is less than the index threshold, it may be understood that the species source of the first candidate class human engineered antibody sequence is still a non-human antibody sequence, and it may be determined that the first candidate class human engineered antibody sequence corresponding to the maximum numerical value of the class human index needs to be further engineered, meanwhile, the first candidate human-like modified antibody sequence corresponding to the maximum numerical value of the human-like index can be taken as the most suitable modified antibody sequence, i.e. the covering position corresponding to the first candidate human-like modified antibody sequence corresponding to the maximum numerical value of the human-like index is taken as the most suitable modified amino acid position in the current step, and is marked as the modified position.

Further, the first candidate human-like engineered antibody sequence corresponding to the maximum value of the current human-like index, the engineered position and the corresponding human-like index can be displayed in real time through an antibody species engineering interface as shown in FIG. 16, then, further performing antibody species humanization on the first candidate human engineered antibody sequence corresponding to the maximum value of the human index, i.e. masking the amino acid at each position of the first candidate human engineered antibody sequence corresponding to the maximum value of the human-like index in turn, similar to the method of step S102, when the human-like index is smaller than the index threshold, sequentially masking the amino acids at each position in the antibody sequence to be modified to obtain a first masked antibody sequence corresponding to each masked position, which is not described herein again to obtain a second masked antibody sequence corresponding to each masked position.

For example, assuming that 3 first candidate human engineering antibody sequences such as BAA, CAA and DAA corresponding to the obtained masking position X are respectively output with human-like indexes of 0.9001, 0.9777 and 0.8005 corresponding to each first candidate human engineering antibody sequence through a species discrimination model, similarly, assuming that 3 first candidate human engineering antibody sequences corresponding to another masking position such as ABA, AEA and AFA are respectively provided with human-like indexes of 0.8771, 0.9553 and 0.8664 corresponding to another masking position, assuming that 3 first candidate human engineering antibody sequences corresponding to another masking position are respectively provided with human-like indexes of 0.8121, 0.9442 and 0.8999, the human-like indexes are known by pairwise comparison, the maximum value is 0.9777, then the maximum value 0.9777 can be known by comparison with a preset index threshold value of 0.999, the maximum value 0.9991 is less than the preset index threshold value of 0.999, then the first candidate human engineering antibody sequence can be determined as an antibody sequence that needs to be continuously modified, meanwhile, the masking position C corresponding to the first candidate human engineering antibody sequence CAA is marked as an engineered position, and amino acids on the first candidate human engineering antibody sequence CAA are masked, for example, the second masking antibody sequence corresponding to each masking position is ZAA, CZA and CAZ.

In step S302, the second masking antibody sequence corresponding to each masking position is input to the human antibody pre-training model, and P second candidate human modified amino acids corresponding to each masking position are obtained through the human antibody pre-training model;

in this embodiment, after the second masking antibody sequence corresponding to each masking position is obtained, the second masking antibody sequence corresponding to each masking position may be respectively input into the human antibody pre-training model, and the human amino acid prediction is performed on the second masking antibody sequence by the human antibody pre-training model, so as to predict amino acids suitable for modification at each masking position, that is, P second candidate human-like modified amino acids.

Specifically, the second masking antibody sequence corresponding to each masking position is input into a human antibody pre-training model, the P second candidate human modified amino acids corresponding to each masking position are obtained through the human antibody pre-training model, specifically, a second human modified amino acid set corresponding to each masking position is output through the human antibody pre-training model, and the P second candidate human modified amino acids are respectively selected from each second human modified amino acid set based on the modification probability value corresponding to each second human modified amino acid.

For example, assuming that a second masked antibody sequence is CZA, a plurality of human amino acids that can be replaced at the masking position Z can be predicted by predicting the human amino acids of the CZA sequence through a human antibody pre-training model, and then P (e.g., 3) first candidate human modified amino acids such as F, G and H with larger modification probability values can be selected from the plurality of human amino acids based on the modification probability value corresponding to each second human modified amino acid.

In step S303, for each of the other masked positions except the modified position, replacing the amino acid at the masked position with P second candidate human-like modified amino acids, respectively, to obtain P second candidate human-like modified antibody sequences corresponding to each masked position;

in this embodiment, after obtaining P second candidate human-like engineered amino acids corresponding to each masking position, since the modified position is marked in the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index, that is, the amino acid position does not need to be modified again, for each masking position in other masking positions except the modified position, the amino acid in the masking position may be replaced by the P second candidate human-like engineered amino acids corresponding to each masking position, so as to obtain P second candidate human-like engineered antibody sequences corresponding to each masking position after replacement.

Specifically, after P second candidate human engineered amino acids are obtained for each masking position, P (e.g., 3) second candidate human engineered amino acids with higher modification probability values are selected from a plurality of human-derived amino acids corresponding to other masking positions except the engineered position C, for example, the masking position Z on the sequence CZA, and are F, G and H, respectively, the 3 second candidate human engineered amino acids can be used to replace the amino acid Z at the masking position to modify the second masking antibody sequence to CZA, and 3 first candidate human engineered antibody sequences corresponding to the masking position Z, such as CFA, CGA and CHA, can be obtained.

In step S304, for each masking position in the other masking positions except the modified position, the P second candidate human-like modified antibody sequences are respectively input to the species discrimination model, and a human-like index corresponding to each second candidate human-like modified antibody sequence is output through the species discrimination model;

in this embodiment, after the P second candidate human-like engineered antibody sequences corresponding to each masking position are obtained, for each masking position in the other masking positions except the engineered position, the P second candidate human-like engineered antibody sequences may be respectively input to the species discrimination model, and the human-like index corresponding to each second candidate human-like engineered antibody sequence may be output through the species discrimination model.

Specifically, after the P second candidate human-like engineered antibody sequences corresponding to each masking position in the other masking positions except the engineered position are obtained, the P second candidate human-like engineered antibody sequences corresponding to each masking position may be respectively input into the species discrimination model for species prediction, the method may be similar to the method in which the antibody sequence to be engineered is input into the species discrimination model in step S101, and the human-like index corresponding to the antibody sequence to be engineered is output through the species discrimination model, and is not described here again, so as to obtain the human-like index corresponding to each second candidate human-like engineered antibody sequence.

For example, for other masking positions than the engineered position C, for example, 3 second candidate human engineered antibody sequences such as CFA, CGA and CHA corresponding to the masking position Z of the sequence CZA are respectively output by species discrimination model with human-like indices of 0.9401, 0.9998 and 0.8665 corresponding to each second candidate human engineered antibody sequence, and similarly, it is assumed that 3 first candidate human engineered antibody sequences such as CAT, CAG and CAE corresponding to another masking position are respectively output by species discrimination model with human-like indices of 0.8891, 0.9663 and 0.8994 corresponding to each second candidate human engineered antibody sequence.

In step S305, if the maximum value of the human-like index is greater than or equal to the index threshold, the second candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index is determined as the target engineered antibody sequence, and the masked position corresponding to the target engineered antibody sequence is marked as the engineered position.

In this embodiment, after the human-like index corresponding to each second candidate human-like modified antibody sequence is obtained, when the maximum value of the human-like index is greater than or equal to the index threshold, it may be understood that the species source of the second candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index is a human antibody sequence, and no modification is needed, the second candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index may be determined as the target modified antibody sequence, and the masked position corresponding to the target modified antibody sequence is marked as a modified position, so that the target object may accurately sense the amino acid position where modification occurs.

Specifically, after the class human index corresponding to each second candidate class human engineered antibody sequence is obtained, whether the second candidate class human engineered antibody sequence needs to be humanized or not may be determined by determining whether the maximum numerical value of the class human index corresponding to each second candidate class human engineered antibody sequence meets the humanization condition, specifically, the class human index corresponding to each second candidate class human engineered antibody sequence is pairwise compared to obtain the maximum numerical value of the class human index corresponding to the second candidate class human engineered antibody sequence, then, the obtained maximum numerical value may be compared with an index threshold, when the maximum numerical value of the class human index is greater than or equal to the index threshold, it may be understood that the species source of the second candidate class human engineered antibody sequence is the human antibody sequence, the second candidate class human engineered antibody sequence corresponding to the maximum numerical value of the class human index may be determined as the target engineered antibody sequence, meanwhile, the covering position corresponding to the target modified antibody sequence is marked as a modified position, the target modified antibody sequence of the current iteration, the modified position and the corresponding human-like index can be displayed in real time through the antibody species modification interface shown in fig. 16, so that the antibody species humanization modification of the antibody sequence to be modified is completed, the modification condition of the antibody sequence to be modified can be reflected in real time through the antibody species modification interface shown in fig. 16, a target object can know the effect of the antibody sequence to be modified in time conveniently, and the user experience is improved.

For example, assuming that the human-like indices corresponding to each of the second candidate human engineered antibody sequences, such as CFA, CGA, CHA, CAT, CAG, and CAE, are 0.9401, 0.9998, 0.8665, 0.8891, 0.9663, and 0.8994, respectively, the maximum value is 0.9998 by comparing the human-like indices two by two, and then the maximum value of 0.9998 can be compared with the preset index threshold value of 0.999, and the maximum value of 0.9998 is greater than the preset index threshold value of 0.999, the second candidate human engineered antibody sequence CGA can be determined as the target engineered antibody sequence, and the masked position G corresponding to the target engineered antibody sequence is labeled as the engineered position.

Further, assuming that the human-like index corresponding to each second candidate human-like engineered antibody sequence, such as CFA, CGA, CHA, CAT, CAG, and CAE, is 0.9401, 0.9888, 0.8665, 0.8891, 0.9663, and 0.8994, respectively, and the maximum value 0.9888 is less than the preset index threshold value of 0.999, the CGA of the second candidate human-like engineered antibody sequence corresponding to the maximum value of 0.9998 may be determined as the optimal engineering, and the masked position G is marked as the engineered position, and then step S304 in step S301 is repeatedly performed until the human-like index of the candidate human-like engineered antibody sequence of the current iteration is greater than the index threshold value, thereby completing the antibody humanization engineering of the antibody sequence to be engineered.

It is understood that the second masking antibody sequence, the second candidate human engineered antibody amino acid, and the second of the second candidate human engineered antibody sequences in this example can be used to broadly refer to the second, third, fourth, or mth, etc., M being an integer greater than 1.

For example, in this embodiment, each position (assuming 100 positions) on the antibody sequence to be modified is first masked (mask) one by one, and a human antibody pre-training model is input to obtain a possible modified amino acid sequencing set, and obtaining P first candidate human-like reconstruction amino acids corresponding to each covering position, if taking the first five, obtaining 500 first candidate human-like reconstruction sequences to be reconstructed, wherein 5 x 100 are all to be reconstructed, furthermore, 500 first candidate human-like reconstruction sequences are input into the species discrimination model, human-like indexes corresponding to the 500 first candidate human-like reconstruction sequences respectively can be obtained, then, the first candidate human-like reconstruction sequence with the highest human-like index is taken as the optimal reconstruction of the iteration, that is, each iteration can modify an amino acid at one amino acid position, and repeat the above operations until the human-like index of the second candidate human-like modified sequence exceeds 0.999, resulting in the target modified sequence.

Alternatively, on the basis of the above embodiment corresponding to fig. 2, in another optional embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 4, step S103 inputs the first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and obtains P first candidate human-like modified amino acids corresponding to each masking position through the human antibody pre-training model, including:

in step S401, a first masking antibody sequence corresponding to each masking position is input to a human antibody pre-training model, and a first human engineered amino acid set corresponding to each masking position is output through the human antibody pre-training model;

in step S402, P first candidate human engineered amino acids are selected from each first human engineered amino acid set based on the engineered probability value corresponding to each first human engineered amino acid.

In this example, after the first masking antibody sequence corresponding to each masking position is obtained, the first masking antibody sequence corresponding to each masking position can be respectively input into a human antibody pre-training model, predicting the humanized amino acid of the first masking antibody sequence by a humanized antibody pre-training model to predict the amino acid suitable for modification at each masking position so as to obtain a first type of human modified amino acid set corresponding to each masking position, then, P first candidate human modified amino acids can be selected from each first human modified amino acid set based on the modification probability value corresponding to each first human modified amino acid, and the first candidate human modified amino acids suitable for modification of the amino acid at the covering position can be better screened from the first human modified amino acid sets which can be modified, so that the accuracy of humanized modification of antibody species can be improved to a certain extent.

Specifically, after the first masking antibody sequence corresponding to each masking position is obtained, the first masking antibody sequence corresponding to each masking position may be respectively input into a human antibody pre-training model, human amino acid prediction is performed on the first masking antibody sequence through the human antibody pre-training model to predict amino acids suitable for modification at each masking position, so as to obtain a first human modified amino acid set corresponding to each masking position, then, the amino acids in the first human modified amino acid set may be subjected to descending order sorting based on the modification probability value corresponding to each first human modified amino acid, and based on the descending order sorting, P first candidate human modified amino acids may be respectively selected from each first human modified amino acid set.

For example, assuming that a first masking antibody sequence is XAA, when a human amino acid prediction is performed on the XAA sequence through a human antibody pre-training model, Q, B, C, F and D can be predicted as a plurality of human amino acids that can be replaced at the masking position X, wherein the modification probability values corresponding to each first type of human modified amino acid are 0.3, 0.6, 0.7, 0.4 and 0.5, respectively, and the first P (e.g., 3) first candidate human modified amino acids with larger modification probability values, such as B, C and D, can be selected from the plurality of human amino acids.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 5, the step of inputting the second masked antibody sequence corresponding to each masked position into a human antibody pre-training model, and obtaining P second candidate human-modified amino acids corresponding to each masked position through the human antibody pre-training model includes:

in step S501, a second masking antibody sequence corresponding to each masking position is input to the human antibody pre-training model, and a second human modified amino acid set corresponding to each masking position is output through the human antibody pre-training model;

in step S502, P second candidate human-like engineered amino acids are selected from each second human-like engineered amino acid set based on the engineered probability value corresponding to each second human-like engineered amino acid.

In this example, after the second masking antibody sequence corresponding to each masking position is obtained, the second masking antibody sequence corresponding to each masking position can be respectively input into a human antibody pre-training model, predicting the humanized amino acid of the second masking antibody sequence by a humanized antibody pre-training model to predict the amino acid suitable for modification at each masking position so as to obtain a second type of human modified amino acid set corresponding to each masking position, then, P second candidate human modified amino acids can be selected from each second human modified amino acid set respectively based on the modification probability value corresponding to each second human modified amino acid, and second candidate human modified amino acids suitable for modification of the amino acid at the masking position can be better screened from the second human modified amino acid sets which can be modified, so that the accuracy of antibody species humanized modification can be improved to a certain extent.

Specifically, after the second masking antibody sequence corresponding to each masking position is obtained, the second masking antibody sequence corresponding to each masking position may be respectively input into a human antibody pre-training model, the human antibody pre-training model is used to predict the human amino acids of the second masking antibody sequence, so as to predict the amino acids suitable for modification at each masking position, so as to obtain a second human modified amino acid set corresponding to each masking position, then, the amino acids in the second human modified amino acid set may be subjected to descending order sorting based on the modification probability value corresponding to each second human modified amino acid, and based on the descending order sorting, P second candidate human modified amino acids may be respectively selected from each second human modified amino acid set.

For example, assuming that a second masking antibody sequence is CZA, when a human amino acid prediction is performed on the CZA sequence by a human antibody pre-training model, F, G, V, W and H can be predicted as a plurality of human amino acids that can be replaced at the masking position Z, wherein the modification probability values corresponding to each second type of human modified amino acid are 0.8, 0.6, 0.4, 0.3 and 0.7, respectively, and the first P (e.g. 3) second candidate human modified amino acids with higher modification probability values, such as F, G and H, can be selected from the plurality of human amino acids.

Alternatively, on the basis of the embodiment corresponding to fig. 2, in another alternative embodiment of the method for modifying antibody species provided in the embodiment of the present application, as shown in fig. 6, before step 103, the training of the human antibody pre-training model includes the following steps:

in step S601, a human antibody sequence set is obtained;

in step S602, the human antibody sequence set is input to the basic human antibody pre-training model, and a human antibody vector corresponding to each human antibody sequence in the human antibody sequence set, an amino acid vector corresponding to each amino acid in each human antibody sequence, and a prediction probability value are output through the human antibody pre-training model;

in step S603, calculating a first prediction loss value based on the humanized antibody vector, the amino acid vector, and the prediction probability value;

in step S604, a parameter of the basic human antibody pre-training model is adjusted based on the first predicted loss value, so as to obtain a human antibody pre-training model.

In this embodiment, before obtaining P first candidate human-like modified amino acids corresponding to each masking position through the human-derived antibody pre-training model, in order to better obtain P first candidate human-like modified amino acids corresponding to each masking position, the embodiment may train the human-derived antibody pre-training model, may input the obtained human-derived antibody sequence set to the basic human-derived antibody pre-training model, output a human-derived antibody vector corresponding to each human-derived antibody sequence in the human-derived antibody sequence set, an amino acid vector corresponding to each amino acid in each human-derived antibody sequence, and a predicted probability value through the human-derived antibody pre-training model, then calculate a first predicted loss value based on the human-derived antibody vector, the amino acid vector, and the predicted probability value, and perform parameter adjustment on the basic human-derived antibody pre-training model based on the first predicted loss value, and (4) until the basic human antibody pre-training model converges, so as to better obtain the human antibody pre-training model which can be used for predicting the human amino acid.

The human antibody sequence set is specifically a set of human antibody sequences of which the species sources are all human and which are downloaded from a natural antibody library. The basic human antibody pre-training model may be embodied as a natural language model BERT, and may also be embodied as other model frameworks, which are not specifically limited herein. The first loss function is used for reflecting the similarity between the predicted value and the actual value of the human amino acid predicted by the human antibody pre-training model, and the first loss function may be specifically expressed as an exponential loss function or a square error loss function, and may also be expressed as other functional expressions, which are not specifically limited here.

Specifically, all human antibodies, namely a human antibody sequence set, are extracted from a natural antibody library, the human antibodies are subjected to cover pre-training independently, the obtained human antibody sequence set is input into a basic human antibody pre-training model, a human antibody vector corresponding to each human antibody sequence in the human antibody sequence set, an amino acid vector corresponding to each amino acid in each human antibody sequence and a prediction probability value are output through the human antibody pre-training model, then loss calculation is carried out on the human antibody vector, the amino acid vector and the prediction probability value based on a prediction loss function formula to obtain a first prediction loss value, and parameter adjustment is carried out on the basic human antibody pre-training model based on the first prediction loss value, for example, a back propagation mode is adopted until model parameters of the basic human antibody pre-training model tend to be stable, the basic human antibody pre-training model converges, a trained human antibody pre-training model can be obtained, and finally the human antibody pre-training model can be assembled into a reformer and can be used in combination with an antibody species reforming interface shown in fig. 16 to realize the precise reforming of an antibody sequence.

Further, it can be seen through experiments that the modification is rationalized and analyzed by using the sequence characterization provided by the human antibody pre-training model. For example, as shown in fig. 19, the sequence high-dimensional representation learned by the human-derived antibody pre-training model may be reduced to a two-dimensional space by using the UMAP algorithm, then the sequence low-dimensional coordinates to be modified and modified are marked as "squares" at two ends of the arrow shown in fig. 19, and the modification direction is indicated by an arrow. Projected with the engineered sequence are several antibody sequences known to be human and murine as reference background. As shown in fig. 19, the modified sequence can be moved from the mouse source center (indicated by "x" in fig. 19) to the human source sequence (indicated by "dot" in fig. 19), thereby verifying the rationality of the modification.

Further, experiments show that the present embodiment can detect the engineering efficiency between the human antibody pre-training model and the traditional engineering technology. It was shown by testing the same antibody engineering task that human antibody pre-training of the model took about 3 minutes, whereas traditional engineering techniques such as Hu-mAb took 48 minutes.

Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 7, the species discrimination model includes a language identification network and a species discriminator; step S101, inputting an antibody sequence to be modified into a species discrimination model, and outputting a human-like index corresponding to the antibody sequence to be modified through the species discrimination model, wherein the human-like index comprises the following steps:

in step S701, the antibody sequence to be modified is input to the species discrimination model, and a vector of the antibody to be modified corresponding to the antibody sequence to be modified is obtained through the language identification network;

in step S702, the antibody vector to be modified is transmitted to the species discriminator, and the human-like index is output by the species discriminator.

In this embodiment, as shown in fig. 12, the collected antibody sequence to be modified may be input into a species discrimination model trained in advance, an antibody vector to be modified corresponding to the antibody sequence to be modified may be obtained first through a language identification network, then the antibody vector to be modified is transmitted to a species discriminator, and the species discrimination is performed on the antibody vector to be modified through the species discriminator to obtain the human-like index.

The language recognition network may be specifically represented as a model obtained by pretraining based on a natural language model BERT as illustrated in fig. 12, and may also be represented as another model, where the model is not specifically limited, and the language recognition network may be used to convert an antibody sequence to be modified into a vector representation convenient for computer recognition, and may improve the efficiency of modifying a corresponding antibody species to a certain extent. The species discriminator can be embodied as a full connection layer added after the language identification network, and can carry out classification task learning on the species labels for species classification.

Specifically, a target object (e.g., a pharmaceutical factory) may collect and download a series of antibody sequences to be modified from a natural antibody library, and may submit the antibody sequences to be modified to an antibody species prediction interface shown in fig. 15, the target object may perform a classification prediction on the acquired antibody sequences to be modified by clicking a submit button, may acquire an antibody vector to be modified corresponding to the antibody sequence to be modified through a language identification network, then transmit the antibody vector to be modified to a species discriminator, and perform a species discrimination on the antibody vector to be modified through the species discriminator to acquire a human-like index, for example, the human-like index corresponding to the heavy chain encoding "qvql.

Alternatively, on the basis of the embodiment corresponding to fig. 7, in another alternative embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 8, before step S701, the training of the species discrimination model includes the following steps:

in step S801, a candidate species tag antibody sequence set is obtained, where the candidate species tag antibody sequence set includes candidate species tag antibody sequences and species tags, and the number of candidate species tag antibody sequences corresponding to each species tag is the same;

in step S802, the candidate species tag antibody sequence set is input to a language identification network, and a candidate species tag antibody vector corresponding to each candidate species tag antibody sequence, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence, and a species category probability value are output through the language identification network;

in step S803, a species classification loss value is calculated based on the candidate species tag antibody vector, the amino acid vector, and the species class probability value;

in step S804, based on the species classification loss value and the species label, a parameter of the basic classifier is adjusted to obtain a species discriminator.

In the embodiment, before the vector of the antibody to be modified corresponding to the sequence of the antibody to be modified is obtained through the language recognition network, in order to better perform species prediction on the antibody sequence to be modified, the embodiment may train a species discrimination model, first obtain a candidate species tag antibody sequence set, and outputting candidate species tag antibody vectors corresponding to the tag antibody sequences of each candidate species, amino acid vectors corresponding to each amino acid in the tag antibody sequences of each candidate species and species class probability values through a language identification network, then, a species classification loss value can be calculated based on the candidate species signature antibody vector, the amino acid vector and the species class probability value, and then based on the species classification loss value and the species signature, and adjusting parameters of the basic classifier until the basic classifier converges so as to better obtain the species discriminator capable of accurately predicting species.

The candidate species tag antibody sequence set comprises candidate species tag antibody sequences and species tags, and the number of the candidate species tag antibody sequences corresponding to each species tag is the same, namely the same number of antibody sequences are randomly extracted from a natural antibody library for each species. The species classification loss value is used to reflect the similarity between the probability value and the true value of the species class predicted by the species discriminator, and the species classification loss function may be specifically expressed as a cross entropy loss function or a logarithmic loss function, and may also be expressed as other functional expressions, which are not specifically limited herein.

Specifically, the training of the species discrimination model may be represented as a pre-training phase of the language identification network and a fine-tuning phase of the species discriminator, that is, after the pre-trained language identification network is obtained, vector conversion may be performed on each candidate species tag antibody sequence through the language identification network to obtain a candidate species tag antibody vector, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence, and a species class probability value of the amino acid vector, then, classification task learning may be performed on the species tag, that is, loss calculation may be performed on the candidate species tag antibody vector, the amino acid vector, and the species class probability value based on a cross entropy loss function formula to obtain a species classification loss value, and then, based on the species classification loss value and the species tag, an error is reversely returned to the full-connection layer parameter through the cross entropy for updating, until the parameters of the full connection layer tend to be stable, the basic classifier converges to obtain the species discriminator.

Further, as shown in fig. 17, the present embodiment can extract the species source of 481 antibody drugs that have passed through the first clinical stage according to the nomenclature of antibodies by collecting them: for example, assume that an antibody name containing u is a human antibody, denoted "-u-"; "-zu-" humanized antibody (5% non-human, 95% human); "-xi-" is a chimeric antibody (25% non-human; 75% human); "-xizu-" is intermediate between a humanized antibody and a chimeric antibody; "-o-" is a murine antibody. Then, the collected antibody drugs can be classified and predicted by the trained species discriminator, and specifically, the 481 antibody drugs are placed in fig. 17 according to six classified species prediction distributions. The results show that the trained species discriminator successfully identified the 176 human antibodies as human, while the murine antibodies were only misclassified one (as shown in the sixth row and the third column of FIG. 17), wherein the third column from the third row to the fifth row of FIG. 17 shows the case of successful humanization, i.e., the drug successfully modified the heterologous antibody to human by grafting. The fourth column of the third to fifth rows, the fifth column of the third to fifth rows, the sixth column of the third row and the seventh column of the fifth row of fig. 17 are examples of insufficiently engineered antibody drugs.

Further, it is known from experiments that, in the present embodiment, by collecting 218 first-phase clinical antibodies, the proportion of ADA generated after the patient takes the medicine can be detected, and the ADA can be used for determining whether the antibody medicine generates immune rejection reaction in the human body. Wherein a high ADA indicates that most patients develop immune rejection as shown on the vertical axis of FIG. 18(a) and on the vertical axis of FIG. 18(b), and wherein a lower ADA is more accurate as shown on the horizontal axis of FIG. 18(a) and on the horizontal axis of FIG. 18(b), i.e., the ADA is negatively correlated with the human index. The results show that a trained species discriminator as illustrated in fig. 18(b), such as a conventional species discrimination technique (e.g., Hu-mAb) as illustrated in fig. 18(a), has a correlation coefficient of-0.5389, while a conventional species discrimination technique (e.g., Hu-mAb) as illustrated in fig. 18(a) has a correlation coefficient of-0.3513, for example, in predicting immune rejection.

Optionally, on the basis of the embodiment corresponding to fig. 8, in another optional embodiment of the method for modifying an antibody species provided in the embodiment of the present application, as shown in fig. 9, step S802 inputs the set of candidate species tag antibody sequences into a language identification network, and outputs a candidate species tag antibody vector corresponding to each candidate species tag antibody sequence, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence, and a species class probability value through the language identification network, before the method further includes:

in step S901, a candidate unlabeled antibody sequence set is obtained;

in step S902, the basic language identification network is pre-trained based on the candidate unlabeled antibody sequence set and the mask strategy, so as to obtain a language identification network.

In this embodiment, in the pre-training stage of the language identification network, a candidate unlabeled antibody sequence set may be obtained, and the basic language identification network is pre-trained based on the candidate unlabeled antibody sequence set and a mask strategy to obtain the language identification network, and the language identification network can better learn the semantic relationship and the structural relationship between the amino acids at each position through the "context" of the antibody sequence by using the unlabeled and mask strategies, so that the language identification network with higher prediction accuracy can be obtained.

The candidate unlabeled antibody sequence set can be specifically represented by a set of unlabeled antibody sequences collected from a natural antibody library and screened.

Specifically, in the pre-training stage of the speech recognition network, a candidate unlabeled antibody sequence set may be obtained, a mask strategy is adopted in the training of the speech recognition network, specifically, one amino acid in the unlabeled antibody sequence may be randomly masked, and the basic speech recognition network is made to predict what the masked amino acid is through the "context" of the amino acid in the unlabeled antibody sequence, so as to obtain the speech recognition network.

Alternatively, on the basis of the embodiment corresponding to fig. 9, in another optional embodiment of the method for engineering an antibody species provided in the embodiment of the present application, as shown in fig. 10, each candidate unlabeled antibody sequence in the candidate unlabeled antibody sequence set includes a variable region and a non-variable region; step S902, pre-training the basic language identification network based on the candidate unlabeled antibody sequence set and the mask strategy to obtain the language identification network, including:

in step S1001, respectively performing high-frequency mask learning on a variable region of each candidate unlabeled antibody sequence based on a first mask learning rate and respectively performing low-frequency mask learning on an invariant region of each candidate unlabeled antibody sequence based on a second mask learning rate through a basic language identification network, where the first mask learning rate is greater than the second mask learning rate;

in step S1002, calculating a mask learning error rate of the basic language identification network on the candidate unlabeled antibody sequence set;

in step S1003, when the mask learning error rate is less than or equal to the learning threshold, the current base language identification network is determined as the language identification network.

In this embodiment, in a pre-training stage of a language identification network, in this embodiment, high-frequency mask learning is performed on a variable region of each candidate unlabeled antibody sequence based on a first mask learning rate, and low-frequency mask learning is performed on an invariant region of each candidate unlabeled antibody sequence based on a second mask learning rate, so that the language identification network can better and more specifically learn semantic relationships and structural relationships between amino acids at each position, and further, by calculating a mask learning error rate of the basic language identification network on a set of candidate unlabeled antibody sequences, when the mask learning error rate is less than or equal to a learning threshold, a current basic language identification network can be determined as the language identification network, so as to obtain a language identification network with a better effect.

Wherein the first mask learning rate is greater than the second mask learning rate. The first mask learning rate may be expressed as a high frequency learning rate, for example, 20% mask rate. The second mask learning rate may be expressed as a low frequency learning rate, such as a mask rate of 5%.

Specifically, while the mask strategy is adopted in the training of the language identification network, the present embodiment also adopts a novel mask strategy, that is, high-frequency mask learning is performed on the variable region of each candidate unlabeled antibody sequence based on the first mask learning rate, and low-frequency mask learning is performed on the invariable region of each candidate unlabeled antibody sequence based on the second mask learning rate, for example, high-frequency learning is performed on the variable region (mask rate 20%), while the learning frequency is reduced for the invariable region (mask rate 5%), then, when the model reaches a certain accuracy in the mask prediction task, a pre-trained model can be extracted as the language identification network for predicting and modifying the antibody sequences, that is, the mask learning error rate of the candidate unlabeled antibody sequence set by the statistical base language identification network, when the mask learning error rate falls to be stable (for example, the error rate falls below 25%), training can be stopped, and the language identification network corresponding to the current model parameter is obtained.

Alternatively, on the basis of the above embodiment corresponding to fig. 9, in another optional embodiment of the method for engineering an antibody species provided in the embodiment of the present application, as shown in fig. 11, the step S901 obtains a set of candidate unlabeled antibody sequences, including:

in step S1101, a set of unlabeled antibody sequences is downloaded from an antibody database;

in step S1102, based on the number of occurrences of each unlabeled antibody sequence in the unlabeled antibody sequence set, a candidate unlabeled antibody sequence is screened from the unlabeled antibody sequence set to obtain a candidate unlabeled antibody sequence set.

In this embodiment, in a pre-training stage of the speech recognition network, a set of unlabeled antibody sequences may be downloaded from an antibody database, and then, based on the occurrence frequency of each unlabeled antibody sequence in the set of unlabeled antibody sequences, a candidate unlabeled antibody sequence is screened from the set of unlabeled antibody sequences to obtain a set of candidate unlabeled antibody sequences, so that the subsequent unsupervised pre-training of the speech recognition network is better performed based on the set of screened candidate unlabeled antibody sequences.

Specifically, a large number of downloaded antibody sequences in a natural antibody database, such as 18 hundred million antibody sequences, are first selected, i.e., a set of unlabeled antibody sequences, then, based on the number of occurrences of each unlabeled antibody sequence in the set of unlabeled antibody sequences, antibody sequences with data redundancy redundacy >1, i.e., antibody sequences that have occurred at least twice in the set of unlabeled antibody sequences, are selected to eliminate errors in sequencing and assembly of the antibody sequences, and then, a set of candidate unlabeled antibody sequences (e.g., about 1.3 hundred million unlabeled antibody sequences) can be selected, so that the selected set of candidate unlabeled antibody sequences can be subsequently sent to a BERT language model for unsupervised pre-training.

Referring to fig. 20, fig. 20 is a schematic view of an embodiment of the apparatus for modifying an antibody species according to the present invention, wherein the apparatus 20 for modifying an antibody species includes:

the obtaining unit 201 is configured to input an antibody sequence to be modified into a species discrimination model, and output a human-like index corresponding to the antibody sequence to be modified through the species discrimination model;

the processing unit 202 is configured to, when the human-like index is smaller than an index threshold, sequentially mask amino acids at each position in an antibody sequence to be modified to obtain a first masked antibody sequence corresponding to each masked position;

the obtaining unit 201 is further configured to input the first masking antibody sequence corresponding to each masking position into the human antibody pre-training model, and obtain P first candidate human modified amino acids corresponding to each masking position through the human antibody pre-training model, where P is an integer greater than 0;

a processing unit 202, further configured to replace the amino acids at the masking positions with P first candidate human engineered amino acids corresponding to each masking position, respectively, to obtain P first candidate human engineered antibody sequences corresponding to each masking position;

the obtaining unit 201 is further configured to input the P first candidate human-like modified antibody sequences corresponding to each masking position into the species discrimination model, and output a human-like index corresponding to each first candidate human-like modified antibody sequence through the species discrimination model;

a determining unit 203, configured to determine, if the maximum value of the human-like index is greater than or equal to the index threshold, a first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as a target modified antibody sequence, and mark a masking position corresponding to the target modified antibody sequence as a modified position;

the determining unit 203 is further configured to, if the maximum value of the human-like index is smaller than the index threshold, repeat the step of sequentially masking the amino acids at each position in the antibody sequence to be modified to obtain the first masked antibody sequence corresponding to each masked position, and continue to be performed until the maximum value of the human-like index is greater than or equal to the index threshold.

Alternatively, on the basis of the embodiment corresponding to FIG. 20, in another embodiment of the apparatus for engineering antibody species provided in the embodiment of the present application,

the processing unit 202 is further configured to mark, if the maximum value of the human-like index is smaller than the index threshold, a masking position of the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index as an engineered position, and sequentially mask amino acids at each position of the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index to obtain a second masking antibody sequence corresponding to each masking position;

the obtaining unit 201 is further configured to input the second masking antibody sequence corresponding to each masking position into the human antibody pre-training model, and obtain P second candidate human-like modified amino acids corresponding to each masking position through the human antibody pre-training model;

a processing unit 202, further configured to, for each of the other masked positions except the modified position, replace an amino acid at the masked position with P second candidate human-like modified amino acids, respectively, to obtain P second candidate human-like modified antibody sequences corresponding to each masked position;

the obtaining unit 201 is further configured to, for each masking position in the other masking positions except the modified position, input P second candidate human-like modified antibody sequences to the species discrimination model, and output a human-like index corresponding to each second candidate human-like modified antibody sequence through the species discrimination model;

the determining unit 203 is further configured to determine, if the maximum value of the human-like index is greater than or equal to the index threshold, a second candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as the target modified antibody sequence, and mark a masking position corresponding to the target modified antibody sequence as a modified position.

Optionally, on the basis of the embodiment corresponding to fig. 20, in another embodiment of the apparatus for modifying antibody species provided in the embodiment of the present application, the obtaining unit 201 may specifically be configured to:

and selecting P second candidate human-like modified amino acids from each second human-like modified amino acid set respectively based on the modification probability value corresponding to each second human-like modified amino acid.

Alternatively, on the basis of the embodiment corresponding to fig. 20, in another embodiment of the apparatus for engineering antibody species provided in the embodiment of the present application, the processing unit 202 may be configured to:

acquiring a human antibody sequence set;

inputting an antibody sequence to be modified into a species discrimination model, and acquiring an antibody vector to be modified corresponding to the antibody sequence to be modified through a language identification network;

Alternatively, on the basis of the embodiment corresponding to fig. 20, in another embodiment of the apparatus for engineering antibody species provided in the embodiment of the present application, the processing unit 202 may specifically be configured to:

inputting the candidate species tag antibody sequence set into a language identification network, and outputting a candidate species tag antibody vector corresponding to each candidate species tag antibody sequence, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence and a species class probability value through the language identification network;

an obtaining unit 201, configured to obtain a candidate unlabeled antibody sequence set;

the processing unit 202 is further configured to pre-train the basic language identification network based on the candidate unlabeled antibody sequence set and the mask strategy, so as to obtain the language identification network.

downloading a set of unlabeled antibody sequences from an antibody database;

Another exemplary computer device is provided in this application, as shown in fig. 21, fig. 21 is a schematic structural diagram of a computer device provided in this application, and the computer device 300 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 310 (e.g., one or more processors) and a memory 320, and one or more storage media 330 (e.g., one or more mass storage devices) storing an application 331 or data 332. Memory 320 and storage media 330 may be, among other things, transient or persistent storage. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the computer device 300. Still further, the central processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the computer device 300.

The computer device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input-output interfaces 360, and/or one or more operating systems 333, such as a Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM And so on.

The computer device 300 described above is also used to perform the steps in the embodiments corresponding to fig. 2 to 11.

Another aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the method as described in the embodiments shown in fig. 2 to 11.

Another aspect of the application provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the method as described in the embodiments shown in fig. 2 to 11.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims

1. A method of engineering an antibody species comprising:

inputting an antibody sequence to be modified into a species discrimination model, and outputting a human-like index corresponding to the antibody sequence to be modified through the species discrimination model;

when the human-like index corresponding to the antibody sequence to be modified is smaller than an index threshold value, sequentially covering amino acids at each position in the antibody sequence to be modified to obtain a first covered antibody sequence corresponding to each covered position;

inputting a first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and acquiring P first candidate human modified amino acids corresponding to each masking position through the human antibody pre-training model, wherein P is an integer greater than 0;

respectively replacing the amino acids at the covering positions with P first candidate human engineering amino acids corresponding to each covering position to obtain P first candidate human engineering antibody sequences corresponding to each covering position;

inputting the P first candidate human-like modified antibody sequences corresponding to each covering position into the species discrimination model respectively, and outputting a human-like index corresponding to each first candidate human-like modified antibody sequence through the species discrimination model;

if the maximum value of the human-like index is greater than or equal to the index threshold, determining the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index as a target engineered antibody sequence, and marking the masked position corresponding to the target engineered antibody sequence as an engineered position;

and if the maximum value of the human-like index is smaller than the index threshold value, the steps of sequentially masking the amino acid at each position in the antibody sequence to be modified and obtaining a first masked antibody sequence corresponding to each masked position are repeated until the maximum value of the human-like index is larger than or equal to the index threshold value.

2. The method of claim 1, wherein if the maximum value of the human-like index is smaller than the index threshold, the step of repeating the step of masking the amino acid at each position in the antibody sequence to be modified in sequence to obtain the first masked antibody sequence corresponding to each masked position continues until the maximum value of the human-like index is greater than or equal to the index threshold, and comprises:

if the maximum value of the human-like index is smaller than the index threshold value, marking the covering position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index as the modified position, and covering the amino acid at each position of the first candidate human-like modified antibody sequence corresponding to the maximum value of the human-like index in sequence to obtain a second covering antibody sequence corresponding to each covering position;

inputting a second masking antibody sequence corresponding to each masking position into the human antibody pre-training model, and acquiring P second candidate human-like modified amino acids corresponding to each masking position through the human antibody pre-training model;

for each of the masked positions other than the engineered position, replacing the amino acid at the masked position with the P second candidate human engineered amino acids, respectively, resulting in P second candidate human engineered antibody sequences corresponding to each of the masked positions;

for each of the masking positions other than the engineered position, inputting the P second candidate human engineered antibody sequences to the species discrimination model, respectively, and outputting the human-like index corresponding to each of the second candidate human engineered antibody sequences through the species discrimination model;

if the maximum value of the human-like index is greater than or equal to the index threshold, determining the second candidate human engineered antibody sequence corresponding to the maximum value of the human-like index as the engineered target sequence, and labeling the masked position corresponding to the engineered target sequence as the engineered position.

3. The method of claim 1, wherein inputting the first masking antibody sequence corresponding to each of the masking positions into a human antibody pre-training model, and obtaining the P first candidate human engineered amino acids corresponding to each of the masking positions through the human antibody pre-training model comprises:

inputting a first masking antibody sequence corresponding to each masking position into the human antibody pre-training model, and outputting a first type human modified amino acid set corresponding to each masking position through the human antibody pre-training model;

and selecting the P first candidate human modified amino acids from each first human modified amino acid set respectively based on the modification probability value corresponding to each first human modified amino acid.

4. The method of claim 2, wherein the inputting the second mask antibody sequence corresponding to each of the mask positions into the human antibody pre-training model, and obtaining the P second candidate human engineered amino acids corresponding to each of the mask positions through the human antibody pre-training model comprises:

inputting a second masking antibody sequence corresponding to each masking position into the human antibody pre-training model, and outputting a second human modified amino acid set corresponding to each masking position through the human antibody pre-training model;

and selecting the P second candidate human modified amino acids from each second human modified amino acid set respectively based on the modification probability value corresponding to each second human modified amino acid.

5. The method of claim 1, wherein the training of the human antibody pre-training model comprises the steps of:

acquiring a human antibody sequence set;

calculating a first predicted loss value based on the humanized antibody vector, the amino acid vector, and a predicted probability value;

and adjusting parameters of the basic human antibody pre-training model based on the first predicted loss value to obtain the human antibody pre-training model.

6. The method of claim 1, wherein the species discrimination model comprises a language recognition network and a species discriminator;

inputting the antibody sequence to be modified into a species discrimination model, and outputting a human-like index corresponding to the antibody sequence to be modified through the species discrimination model, wherein the human-like index comprises:

inputting the antibody sequence to be modified into a species discrimination model, and acquiring the antibody vector to be modified corresponding to the antibody sequence to be modified through the language identification network;

and transmitting the antibody vector to be modified to the species discriminator, and outputting the human-like index through the species discriminator.

7. The method of claim 6, wherein the training of the species discrimination model comprises the steps of:

obtaining a candidate species tag antibody sequence set, wherein the candidate species tag antibody sequence set comprises candidate species tag antibody sequences and species tags, and the number of the candidate species tag antibody sequences corresponding to each species tag is the same;

inputting the set of candidate species tag antibody sequences into the language identification network, and outputting a candidate species tag antibody vector corresponding to each candidate species tag antibody sequence, an amino acid vector corresponding to each amino acid in each candidate species tag antibody sequence and a species class probability value through the language identification network;

calculating a species classification loss value based on the candidate species tag antibody vector, the amino acid vector, and a species class probability value;

and adjusting parameters of a basic classifier based on the species classification loss value and the species label to obtain the species discriminator.

8. The method of claim 7, wherein the inputting the set of candidate species tag antibody sequences into the language identification network precedes outputting, by the language identification network, a candidate species tag antibody vector corresponding to each of the candidate species tag antibody sequences, an amino acid vector corresponding to each amino acid in each of the candidate species tag antibody sequences, and a species class probability value, the method further comprising:

acquiring a candidate unlabeled antibody sequence set;

and pre-training a basic language identification network based on the candidate unlabeled antibody sequence set and a mask strategy to obtain the language identification network.

9. The method of claim 8, wherein each candidate unlabeled antibody sequence in the set of candidate unlabeled antibody sequences comprises a variable region and an invariant region;

the pre-training of a basic language identification network based on the candidate unlabeled antibody sequence set and a mask strategy to obtain the language identification network comprises:

respectively performing high-frequency mask learning on a variable region of each candidate unlabeled antibody sequence based on a first mask learning rate and performing low-frequency mask learning on an invariant region of each candidate unlabeled antibody sequence based on a second mask learning rate through the basic language identification network, wherein the first mask learning rate is greater than the second mask learning rate;

calculating a mask learning error rate of the underlying language identification network for the set of candidate unlabeled antibody sequences;

determining a current underlying language identification network as the language identification network when the mask learning error rate is less than or equal to a learning threshold.

10. The method of claim 8, wherein obtaining the set of candidate unlabeled antibody sequences comprises:

downloading a set of unlabeled antibody sequences from an antibody database;

and screening the candidate unlabeled antibody sequences from the unlabeled antibody sequence set based on the occurrence frequency of each unlabeled antibody sequence in the unlabeled antibody sequence set so as to obtain the candidate unlabeled antibody sequence set.

11. An apparatus for engineering an antibody species, comprising:

the obtaining unit is used for inputting an antibody sequence to be modified into a species discrimination model and outputting a human-like index corresponding to the antibody sequence to be modified through the species discrimination model;

the obtaining unit is further configured to input a first masking antibody sequence corresponding to each masking position into a human antibody pre-training model, and obtain P first candidate human engineered amino acids corresponding to each masking position through the human antibody pre-training model, where P is an integer greater than 0;

the processing unit is further used for replacing the amino acids at the covering positions with the P first candidate human engineering amino acids corresponding to each covering position respectively to obtain P first candidate human engineering antibody sequences corresponding to each covering position;

the obtaining unit is further configured to input the P first candidate human-like engineered antibody sequences corresponding to each of the masking positions into the species discrimination model, and output the human-like index corresponding to each of the first candidate human-like engineered antibody sequences through the species discrimination model;

a determining unit, configured to determine, if the maximum value of the human-like index is greater than or equal to the index threshold, the first candidate human-like engineered antibody sequence corresponding to the maximum value of the human-like index as a target engineered antibody sequence, and mark the covered position corresponding to the target engineered antibody sequence as an engineered position;

and the determining unit is further configured to, if the maximum value of the human-like index is smaller than the index threshold, repeat the step of sequentially masking the amino acids at each position in the antibody sequence to be modified to obtain a first masked antibody sequence corresponding to each masked position, and continue to execute the step until the maximum value of the human-like index is greater than or equal to the index threshold.

12. A computer device comprising a memory, a processor and a bus system, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program;

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 10 when executed by a processor.