CN117174177A - Training method and device for protein sequence generation model and electronic equipment - Google Patents

Training method and device for protein sequence generation model and electronic equipment Download PDF

Info

Publication number
CN117174177A
CN117174177A CN202311102676.7A CN202311102676A CN117174177A CN 117174177 A CN117174177 A CN 117174177A CN 202311102676 A CN202311102676 A CN 202311102676A CN 117174177 A CN117174177 A CN 117174177A
Authority
CN
China
Prior art keywords
protein sequence
target
model
training
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311102676.7A
Other languages
Chinese (zh)
Inventor
陈致远
薛洋
方晓敏
张肖男
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Publication of CN117174177A publication Critical patent/CN117174177A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a training method and device for a protein sequence generation model and electronic equipment, relates to the field of artificial intelligence, and particularly relates to natural language processing, deep learning and the like. The specific implementation scheme is as follows: acquiring a first training sample corresponding to a target scene; inputting the first training sample into a generative large language model to obtain a target protein sequence generated by the generative large language model; and performing first training on the large-scale generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generation model of the target scene. Therefore, the generated large language model is finely adjusted according to training samples of different scenes, so that protein sequence generated models of different scenes can be obtained, protein designs of different scenes are unified under the generated large language model, different structural models are not required to be designed for different scenes, and applicability is high.

Description

Training method and device for protein sequence generation model and electronic equipment
Technical Field
The application relates to the field of artificial intelligence, in particular to natural language processing, deep learning and the like, and specifically relates to a training method and device of a protein sequence generation model and electronic equipment.
Background
In the field of protein design, different structural models are usually required to be designed according to different task demands or scenes, and a protein sequence is obtained by utilizing the corresponding structural models.
Disclosure of Invention
The application provides a training method and device for a protein sequence generation model and electronic equipment. The specific scheme is as follows:
according to an aspect of the present application, there is provided a training method of a protein sequence generation model, comprising:
acquiring a first training sample corresponding to a target scene;
inputting the first training sample into a generative large language model to obtain a target protein sequence generated by the generative large language model;
and performing first training on the large-scale generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generation model of the target scene.
According to another aspect of the present application, there is provided a protein sequence generating method comprising:
obtaining model input data corresponding to a target scene;
and inputting the model input data into a protein sequence generation model of the target scene to obtain a target protein sequence generated by the protein sequence generation model, wherein the protein sequence generation model is trained by adopting the method of the embodiment of the aspect.
According to another aspect of the present application, there is provided a training apparatus for protein sequence generation model, comprising:
the first acquisition module is used for acquiring a first training sample corresponding to the target scene;
the second acquisition module is used for inputting the first training sample into the large generative language model so as to acquire a target protein sequence generated by the large generative language model;
and the first training module is used for carrying out first training on the large-scale generation language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generation model of the target scene.
According to another aspect of the present application, there is provided a protein sequence generating apparatus comprising:
the first acquisition module is used for acquiring model input data corresponding to a target scene;
and the second acquisition module is used for inputting the model input data into a protein sequence generation model of the target scene to acquire a target protein sequence generated by the protein sequence generation model, wherein the protein sequence generation model is obtained by training the device according to the embodiment of the other aspect.
According to another aspect of the present application, there is provided an electronic apparatus including:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of an embodiment of the above aspect or to perform the method of an embodiment of the above aspect.
According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to the embodiment of the above aspect, or to perform the method according to the embodiment of the above aspect.
According to a further aspect of the present application there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described in the embodiments of the above aspect, or implements the steps of the method described in the embodiments of the above aspect.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
FIG. 1 is a flow chart of a training method of a protein sequence generation model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a training process of a protein sequence generation model according to an embodiment of the present application;
FIG. 3 is a flowchart of a training method of a protein sequence generation model according to another embodiment of the present application;
FIG. 4 is a flowchart of a training method of a protein sequence generation model according to another embodiment of the present application;
FIG. 5 is a flowchart of a training method of a protein sequence generation model according to another embodiment of the present application;
FIG. 6 is a schematic flow chart of a method for generating a protein sequence according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of a method for generating a protein sequence according to another embodiment of the present application;
FIG. 8 is a schematic structural diagram of a training device for generating a model of a protein sequence according to an embodiment of the present application;
FIG. 9 is a schematic diagram showing a protein sequence generating apparatus according to an embodiment of the present application;
FIG. 10 is a block diagram of an electronic device for implementing a training method of a protein sequence generation model of an embodiment of the application.
Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The training method, the training device, the electronic equipment and the storage medium of the protein sequence generation model according to the embodiment of the application are described below with reference to the accompanying drawings.
Artificial intelligence is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person using a computer, both in the technical field of hardware and in the technical field of software. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a deep learning technology, a big data processing technology, a knowledge graph technology and the like.
Natural language processing is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes, but is not limited to, the following branch fields: text classification, information extraction, automatic abstracting, intelligent question and answer, topic recommendation, machine translation, topic word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.
Deep learning is a new research direction in the field of machine learning. Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
FIG. 1 is a flowchart of a training method of a protein sequence generation model according to an embodiment of the present application.
The training method of the protein sequence generation model can be executed by the training device of the protein sequence generation model, the device can be configured in electronic equipment, and the protein sequence generation model of different scenes can be obtained by fine tuning the generation type large language model according to training samples of different scenes, so that protein designs of different scenes are unified under the generation type large language model, different structure models are not required to be designed for different scenes, and the applicability is high.
The electronic device may be any device with computing capability, for example, may be a personal computer, a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc., which have various operating systems, touch screens, and/or display screens.
In some embodiments, the name of the protein sequence is not limited, e.g., the protein sequence may be a protein sequence.
As shown in fig. 1, the training method of the protein sequence generation model comprises the following steps:
step 101, a first training sample corresponding to a target scene is obtained.
In the application, the first training sample corresponding to the constructed target scene can be directly obtained, the description information of the target scene can be obtained, the first training sample corresponding to the target scene can be constructed according to the description information of the target scene, or the first training sample corresponding to the target scene can be obtained in other modes, and the application is not limited to the above.
In the present application, the first training sample may or may not include a sample protein sequence, and training samples corresponding to different scenes may be different.
For example, the first training sample corresponding to the scene a includes a start character, such as < S >, and the start character may be in other forms, which is not limited thereto.
For another example, the first training sample corresponding to scenario B includes descriptive information about the desired protein, such as descriptive information to generate a horseshoe-shaped resulting protein sequence.
Step 102, inputting the first training sample into the generative large language model to obtain a target protein sequence generated by the generative large language model.
In the application, the large generative language model can be a sequence generating model, and the structure of the large generative language model can be any sequence-based generating model.
In the application, the generated large language model can refer to a deep neural network model with millions or billions of parameters, and the model can process large-scale data and tasks and has remarkable achievements in the fields of natural language processing, computer vision, voice recognition and the like. Compared with the traditional small model, the generated large language model has larger parameter scale and deeper neural network structure, and has stronger generalization capability and higher precision when processing large-scale data and tasks.
In the application, the encoder for generating the large language model can encode the first training sample to obtain the encoding characteristic, and the decoder for generating the large language model can decode the encoding characteristic to obtain the target protein sequence.
For example, the target protein sequence may be a sequence of one or more amino acid symbols, where the amino acid symbols may represent corresponding amino acids, and the amino acid symbols may be represented by characters, such as different letters, for example. For example, in the pre-training stage of FIG. 2 (a), the model input protein sequence "GSHS" is generated, each letter may represent an amino acid, < S > is the start character and < E > is the end character.
Or, encoding the first training sample to obtain encoding characteristics, and decoding the encoding characteristics and the current output amino acid symbol of the decoder for generating the large language model to obtain the next output amino acid symbol of the decoder until the decoder outputs an end character to obtain the target protein sequence. Therefore, the decoder decodes the coding characteristics and the generated amino acid symbols to generate the next amino acid symbol, and the accuracy of the target protein sequence is improved.
In the application, the generated large language model can be a trained model which is directly acquired, or can be obtained by training a second training sample of different scenes, and the method is not limited to the method.
As a possible implementation manner, second training samples corresponding to a plurality of scenes respectively may be obtained, where the second training samples corresponding to at least one scene in the plurality of scenes include a protein sequence, and the second training samples corresponding to the plurality of scenes respectively are utilized to perform second training on the initial generation type large language model, so as to obtain the generation type large language model. Thus, the generated large language model can be obtained through training by the second training samples corresponding to the scenes.
For example, there are 20 scenes in total, where 15 scenes correspond to a second training sample comprising a protein sequence.
In the application, the second training samples corresponding to different scenes may be different, and the second training may be pre-training.
And when the second training is carried out, an unsupervised mode can be adopted to train the initial generation type large language model, so that the generation type large language model is obtained.
In the present application, the manner of obtaining the second training samples corresponding to the multiple scenes is similar to the manner of obtaining the first training samples corresponding to the target scene, so that the description thereof is omitted herein.
And step 103, performing first training on the large-scale generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generated model of the target scene.
Wherein the first training may refer to fine tuning.
In the application, the first training sample has a corresponding reference protein sequence, and the reference protein sequence can be used as a label of the first training sample.
According to the method, the parameters of the generated large language model can be adjusted according to the difference between the target protein sequence and the reference protein sequence, and the model after the parameters are adjusted is continuously trained until the training ending condition is met, so that the protein sequence generated model is obtained.
The training ending condition may be that the model loss is smaller than a preset threshold, the training times reach a preset number, or other conditions may be set according to actual needs, which is not limited by the present application.
In order to facilitate understanding of the following description with reference to fig. 2, as shown in fig. 2 (a), a large-scale generation language model, which is a sequence generation model, may be first obtained through pre-training. In pre-training, the protein sequence may be used to train, generating a model output also protein sequence, such as the input protein sequence "GSHS", which model outputs.
After the pre-training is finished, fine adjustment can be performed according to the prompt data set to obtain a protein sequence generation model of the corresponding scene. In the prompt data set, a sample 1 is input of a fine adjustment stage generation model, a protein sequence 1 is a reference protein sequence corresponding to the sample 1, a sample n is input of the fine adjustment stage generation model, and a protein sequence n is a reference protein sequence corresponding to the sample n.
In the embodiment of the application, the generated large language model can be subjected to first training according to the first training sample corresponding to the target scene to obtain the protein sequence generated model of the target scene. Therefore, the generated large language model is finely adjusted according to training samples of different scenes, so that protein sequence generated models of different scenes can be obtained, protein designs of different scenes are unified under the generated large language model, different structural models are not required to be designed for different scenes, and applicability is high.
FIG. 3 is a flowchart of a training method of a protein sequence generation model according to another embodiment of the present application.
As shown in fig. 3, the training method of the protein sequence generation model includes:
step 301, a first training sample corresponding to a target scene is obtained.
Step 302, inputting the first training sample into the generative large language model to obtain the target protein sequence generated by the generative large language model.
In the present application, any implementation manner of the embodiments of the present application may be adopted in steps 301 to 302, which are not limited and are not repeated.
Step 303, determining model loss based on differences between the amino acid symbols at each position in the target protein sequence and the true amino acid symbols at the same position in the reference protein sequence.
According to the application, the loss corresponding to each position can be determined according to the difference between the amino acid symbol at each position in the target protein sequence and the real amino acid symbol at the same position in the reference protein sequence, and then the model loss can be obtained according to the sum of the losses corresponding to each position in the target protein sequence.
As a possible way, the generation type large language model can predict the probability that the amino acid symbol at each position in the target protein sequence belongs to each amino acid symbol, the cross entropy can be calculated according to the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position, the first sub-loss corresponding to each position is obtained, and then the model loss is determined according to the sum of the first sub-losses corresponding to each position in the target protein sequence. Therefore, the loss corresponding to each position in the target protein sequence can be obtained through cross entropy calculation, and further, model loss is obtained according to the loss of each position in the target protein sequence, so that the accuracy of model loss calculation is improved.
When the cross entropy is calculated, for example, the logarithm of the probability that the amino acid symbol at each position belongs to the real amino acid symbol corresponding to each position may be taken, so as to obtain the first sub-loss corresponding to each position.
Alternatively, the sums of the first sub-losses corresponding to the plurality of first training samples may be added, and then an average value may be calculated to obtain the model loss.
And step 304, performing first training on the large-scale generated language model according to the model loss to obtain a protein sequence generated model of the target scene.
According to the method, parameters of the generated large language model can be adjusted according to model loss, and training of the model after parameter adjustment is continued until training ending conditions are met, so that a protein sequence generated model is obtained.
The explanation of the training ending condition may be referred to the above embodiments, so that the explanation is not repeated here.
According to the embodiment of the application, the model loss can be determined according to the difference between the amino acid symbol at each position in the target protein sequence and the real amino acid symbol at the same position in the reference protein sequence, and then the first training is carried out on the generated large language model according to the model loss, so that the protein sequence generation model of the target scene is obtained. Therefore, model loss is calculated according to the difference between the amino acid symbol at each position in the target protein sequence and the real amino acid symbol at the same position in the reference protein sequence, the accuracy of model loss calculation is improved, and the model training efficiency is further improved.
Fig. 4 is a flowchart of a training method of a protein sequence generating model according to another embodiment of the present application.
As shown in fig. 4, the training method of the protein sequence generation model includes:
step 401, a first training sample corresponding to a target scene is obtained.
Step 402, inputting the first training sample into the generative large language model to obtain the target protein sequence generated by the generative large language model.
In the present application, any implementation manner of the embodiments of the present application may be adopted in steps 401 to 402, which are not limited and are not repeated.
Step 403, determining the second loss corresponding to each position according to the square of the difference between the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position and the first tag value, and the square of the difference between the probability that the amino acid symbol at each position belongs to other amino acid symbols and the second tag value.
Wherein the first tag value may be used to characterize that the amino acid symbol at each position in the target protein sequence belongs to a true amino acid symbol and the second tag value may be used to characterize that the amino acid symbol at each position in the target protein sequence does not belong to a true amino acid symbol. For example, the first tag value may be 1 and the second tag value may be 0.
In the application, the generated large language model can predict the probability that the amino acid symbol at each position in the target protein sequence belongs to each amino acid symbol, can calculate the square of the difference between the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position and the first label value, and the square of the difference between the probability that the amino acid symbol at each position belongs to other amino acid symbols and the second label value, and calculates the square sum to obtain the second sub-loss corresponding to each position.
Step 404, determining model loss according to the second sub-loss corresponding to each position in the target protein sequence.
In the application, the model loss is determined according to the sum of the second sub-losses corresponding to each position in the target protein sequence. For example, the sum of the second sub-losses corresponding to the respective positions in the target protein sequence may be used as the model loss.
Alternatively, the sums of the second sub-losses corresponding to the plurality of first training samples may be added, and then the average value may be calculated, that is, the model loss may be obtained by calculating the mean square error.
And step 405, performing first training on the large-scale generated language model according to the model loss to obtain a protein sequence generated model of the target scene.
In the present application, step 405 may be implemented by any one of the embodiments of the present application, which is not limited and will not be described in detail.
In the embodiment of the application, the second sub-loss corresponding to each position can be determined according to the square of the difference between the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position and the first tag value, and the square of the difference between the probability that the amino acid symbol at each position belongs to other amino acid symbols and the second tag value, and the model loss is determined according to the second sub-loss corresponding to each position in the target protein sequence. Therefore, the corresponding loss of each position in the target protein sequence can be obtained by calculating the square of the difference between the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol and the corresponding tag value and the square of the difference between the probability that the amino acid symbol at each position does not belong to the real amino acid symbol and the corresponding tag value, and further the model loss is obtained according to the loss of each position in the target protein sequence, so that the accuracy of calculating the model loss is improved.
FIG. 5 is a flowchart of a training method of a protein sequence generation model according to another embodiment of the present application.
As shown in fig. 5, the training method of the protein sequence generation model includes:
step 501, obtaining description information of a target scene.
The description information of the target scene can be text information or voice information, if the description information is voice information, the voice information can be identified to obtain corresponding text information, and the form of the description information of the scene is not limited.
For example, the description information of a certain target scene is a randomly generated protein sequence.
Step 502, obtaining target characters according to the description information of the target scene.
Wherein the target character may represent a starting position of the generated protein sequence. For example, the target character may be a start character, such as < S > or < start >, etc., and the present application is not limited to the form of the start character.
Step 503, according to the target character, obtaining a first training sample.
According to the application, the target character can be directly used as the first training sample. Wherein the first training sample may be used to prompt the random generation of protein sequences. For example, the first training sample is "< S >".
Alternatively, the first training sample may be obtained according to the description information of the target character and the target scene. For example, the first training sample is the text message "generate random protein sequences from < S >.
Step 504, inputting the first training sample into the generative large language model to obtain the target protein sequence generated by the generative large language model.
Step 505, performing a first training on the large-scale generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample, so as to obtain a protein sequence generated model of the target scene.
In the present application, any implementation manner of the embodiments of the present application may be adopted in the steps 504 to 505, which are not limited and are not repeated.
For example, in fig. 2 (b), a randomly generated protein sequence can be obtained by inputting < S > into a protein sequence generation model obtained by training using the first training sample of the present embodiment.
In the embodiment of the application, the target character can be acquired according to the description information of the target scene, and the first training sample can be acquired according to the target character. Thus, a first training sample for presenting a randomly generated protein sequence can be obtained, and the first training sample can be used to train the large-scale generated language model, thereby obtaining a protein sequence generation model capable of randomly generating a protein sequence.
In acquiring the first training sample of the target scene, in some embodiments, the description information of the target scene may be acquired, and the sample protein sequence of the first target type may be acquired according to the description information of the target scene, and then the first training sample may be acquired according to the sample protein sequence of the first target type.
Wherein the descriptive information of the target scene may be a generation of a second target type of protein sequence bound to the first target type of sample protein sequence from the first target type of sample protein sequence.
Wherein the first training sample may be used to prompt the generation of a second target type of protein sequence that binds to the first target type of sample protein sequence.
In the application, the sample protein sequence of the first target type can be directly used as a first training sample. Illustratively, the target scenario is described as generating an active protein sequence that can bind to a receptor target protein sequence, such as the first training sample may be a segment of the receptor target protein sequence "AHYH".
Therefore, the sample protein sequence of the first target type can be obtained according to the description information of the target scene, and the first training sample can be obtained according to the sample protein sequence of the first target type. Thus, a first training sample for presenting generation of a protein sequence of a second target type that binds to a sample protein sequence of a first target type can be obtained, and the large-scale model of the generation formula can be trained using the first training sample, so that a protein sequence generation model that can generate a protein sequence of a second target type that binds to a sample protein sequence of a first target type can be obtained.
For example, in fig. 2 (b), an active protein sequence can be obtained by inputting a receptor target protein sequence into a protein sequence generation model trained using the first training sample.
In some embodiments, the description information of the target scene may be obtained, and according to the description information of the target scene, the sample protein sequence of the first target type and the sample protein sequence of the second target type are obtained, and then the sample protein sequence of the first target type and the sample protein sequence of the second target type are spliced to obtain the first training sample.
Wherein the descriptive information of the target scene may be a new protein sequence of the second target type generated from the sample protein sequences of the first target type and the sample protein sequences of the second target type.
Wherein the first training sample may be used to prompt generation of a new protein sequence of a second target type from the sample protein sequence of the first target type and the sample protein sequence of the second target type.
In the application, the sample protein sequence of the first target type can be directly used as a first training sample. For example, the descriptive information of the target scenario may be the generation of a new active protein sequence based on the receptor target protein sequence and the known active protein sequence. For example, the receptor target protein sequence "AHYH" may be spliced with the active protein sequence "GGGS" to yield a first training sample "ahyh+gggs".
Alternatively, when splicing the sample protein sequences of the first target type and the sample protein sequences of the second target type, special symbols may be added to distinguish the sample protein sequences of the first target type from the sample protein sequences of the second target type.
In the embodiment of the application, the sample protein sequence of the first target type and the sample protein sequence of the second target type can be obtained according to the description information of the target scene, and the first training sample can be obtained according to the sample protein sequence of the first target type and the sample protein sequence of the second target type. Thus, a training sample for presenting generation of a new protein sequence of a second target type from a sample protein sequence of a first target type and a sample protein sequence of a second target type can be obtained, and a protein sequence generation model capable of generating a new protein sequence of a second target type from the sample protein sequence of the first target type and the sample protein sequence of the second target type can be obtained by training the large language model of the generation formula using the first training sample.
For example, in fig. 2 (b), the receptor target protein sequence and the active protein sequence are input into a protein sequence generation model trained using the first training sample, and the active protein sequence can be obtained.
In some embodiments, description information of the target scene may be obtained, and first text information may be obtained according to the description information of the target scene, where the first text information may be used to describe requirements that the generated protein sequence needs to meet, and then a first training sample is obtained according to the first text information.
The description information of the target scene can be a protein sequence which is generated according to natural language or protein description and meets requirements.
Wherein the first training sample may be used to prompt the generation of a satisfactory protein sequence.
In the application, the first text information can be directly used as the first training sample. For example, the description information of the target scene may be a protein sequence generated to meet the requirements according to natural language or protein description.
For example, the first text message may be a text message "generate a protein sequence that results in a horseshoe shape".
Thus, the large language model of the generation formula is trained by the text information, and a protein sequence generation model capable of generating a protein sequence based on the instruction can be obtained. For example, in fig. 2 (b), the text description information is input into a protein sequence generation model, and a protein sequence satisfying the text description requirements can be obtained.
When the first training sample is acquired, optionally, a sample protein sequence may be acquired according to the description information of the target scene, and second text information may be generated according to the sample protein sequence and the first text information, and the second text information is used as the first training sample.
The first text message is an active protein sequence which can be combined with the receptor target protein sequence, the obtained sample protein sequence is a section of receptor target protein sequence 'AHYH', and the second text message is obtained according to the receptor target protein sequence and the first text message, wherein the active protein sequence which can be combined with the receptor target protein sequence is generated according to 'AHYH'.
Illustratively, the first text message is a new active protein sequence generated based on the receptor target protein sequence and the known active protein sequence, e.g., the obtained receptor target protein sequence is "AHYH" and the active protein sequence is "GGGS", and then the second text message is a new active protein sequence generated based on "ahyh+gggs" based on the receptor target protein sequence "AHYH", the active protein sequence "GGGS" and the first text message.
It will be appreciated that if the target character is obtained from the description information of the target scene, the second text information may be obtained from the target character and the first text information.
For example, the first text information generates a random protein sequence, the target character is "< S >", and the second text information "generates a random protein sequence from < S >" can be obtained from the first text information and the text information.
According to the method and the device, for different scenes, the second text information can be generated according to the description information and the first text information of the scenes, and the second text information is used as the first training sample corresponding to the scenes, so that the large generated language model can be trained by using natural language or protein description as the first training sample for different scenes, a protein sequence generation model capable of generating a protein sequence based on instructions can be obtained, and the applicability is strong.
In the embodiment of the application, the first text information can be acquired according to the description information of the target scene, wherein the first text information is used for describing the requirements to be met by the generated protein sequence, and the first training sample is acquired according to the first text information. Thus, a training sample for presenting the generation of a protein sequence satisfying the requirements can be obtained, and the large language model of the generation formula can be trained using the first training sample, so that a protein sequence generation model capable of generating a protein sequence satisfying the requirements can be obtained.
FIG. 6 is a flow chart of a method for generating a protein sequence according to an embodiment of the application.
As shown in FIG. 6, the protein sequence generation method comprises:
step 601, obtaining model input data corresponding to a target scene.
In the present application, the model input data may be target characters, such as start characters, protein sequences, text information describing requirements that the generated protein sequences need to satisfy, etc., or other forms of input data, which is not limited by the present application.
In the application, different scenes can correspond to different model input data.
The model input data of the target scene in the application can be obtained directly or according to the description information of the target scene or by other modes, and the application is not limited to this.
Step 602, inputting the model input data into a protein sequence generation model of the target scene to obtain a target protein sequence generated by the protein sequence generation model.
In the present application, the protein sequence generation model may be trained by the training method of the protein sequence generation model of the above embodiment.
In the application, the encoder of the protein sequence generating model can encode the model input data to obtain the encoding characteristics, and the decoder of the protein sequence generating model can decode the encoding characteristics to obtain the target protein sequence.
Or, coding the model input data to obtain coding features, and decoding the coding features and the current output amino acid symbol of the decoder to obtain the next output amino acid symbol of the decoder until the decoder outputs an end character to obtain the target protein sequence. Therefore, the decoder decodes the coding characteristics and the generated amino acid symbols to generate the next amino acid symbol, and the accuracy of the target protein sequence is improved.
In the embodiment of the application, the model input data corresponding to the target scene can be obtained, and the model input data is input into the protein sequence generation model to obtain the target protein sequence generated by the protein sequence generation model. Thus, the protein sequence required for different scenes can be generated by using the protein sequence generation model for different scenes.
FIG. 7 is a flow chart of a method for generating a protein sequence according to another embodiment of the present application.
As shown in FIG. 7, the protein sequence generation method comprises:
step 701, obtaining a prompt template, wherein prompt information in the prompt template is used for prompting a protein sequence recognition model to judge whether a target protein sequence meets requirements.
In the present application, the protein sequence recognition model may be a large generated language model, or may be obtained by fine-tuning a large generated language model. The explanation of the large language model may be referred to the above embodiments, and thus will not be repeated here.
According to the method and the device, the prompt template corresponding to the target scene can be obtained according to the description information of the target scene and the corresponding relation between the scene description information and the prompt template.
In the application, different scenes can adopt the same prompting template or different prompting templates, and the application is not limited to the same.
For example, the description of the target scenario is that an active protein sequence is generated from the receptor target protein sequence that can bind to it, suggesting that the template may be "[ protein sequence? "wherein the hint information is" whether or not it is an active protein sequence ". As another example, the description of the target scenario may be to generate a new active protein sequence from the receptor target protein sequence and a known active protein sequence, suggesting that the template may also be "[ protein sequence? ". For another example, the description information of the target scene is to generate a protein sequence with a horseshoe shape as a result, and the prompt template may be "[ protein sequence ] whether or not is the horseshoe-shaped protein sequence? ".
In some embodiments, when obtaining the hint template, a sample protein sequence and a type of sample protein sequence may be obtained, and the hint template is obtained based on the sample protein sequence and the type of sample protein sequence. Thus, the variety of acquisition forms of the prompt templates is enriched.
For example, protein sequence a is an active protein sequence, protein sequence b is an active protein sequence, and protein sequence c is a receptor target protein sequence, and then, depending on these sample protein sequences and their types, a hint template "protein sequence a/active protein sequence, protein sequence b/active protein sequence, protein sequence c/", may be generated. Thus, based on the recognition model of the presentation template protein sequence, the type of the protein sequence c is recognized.
And step 702, splicing the target protein sequence and the prompt template to obtain a spliced text.
In the application, the prompting template can comprise prompting information, a protein sequence to be recognized and the like, and the target protein sequence can be inserted into a preset position of the prompting template to obtain a spliced text, wherein the preset position can be the position of the protein sequence to be recognized in the prompting template.
For example, if the target protein sequence is protein sequence d and the alert template is "[ protein sequence ] is an active protein sequence", the protein sequence d may be inserted into a preset position of the alert template to obtain a spliced text "[ protein sequence d ] is an active protein sequence? "
In step 703, the spliced text is input into the protein sequence recognition model to obtain a recognition result output by the protein sequence recognition model.
In the application, the protein sequence recognition model can encode and decode the spliced text and output the recognition result of the target protein sequence.
For example, will splice text "[ protein sequence d ] is an active protein sequence? And inputting a protein sequence recognition model, wherein the recognition result is yes.
Step 704, determining whether the target protein sequence meets the requirement according to the identification result.
According to the identification result output by the protein sequence identification model, whether the target protein sequence meets the requirements can be determined. For example, the description information of the target scene is that an active protein sequence which can be combined with the target protein sequence is generated according to the target protein sequence of the receptor, the target protein sequence is a protein sequence d, and the recognition result output by the protein sequence recognition model is that the protein sequence d can be determined to be the active protein sequence, so that the requirement is met.
In the present application, each target protein sequence may be identified by using the protein sequence identification model every time the protein sequence generation model outputs the target protein sequence, the protein sequence identification model may be used to identify a predetermined number of target protein sequences after the protein sequence generation model generates the target protein sequence, or the protein sequence identification model may be used to identify each target protein sequence after the completion of the protein sequence generation task.
In the embodiment of the application, the prompt template can be obtained, the target protein sequence and the prompt template are spliced to obtain the spliced text, and the spliced text is identified by using the protein sequence identification model to determine whether the target protein sequence meets the requirement. Therefore, whether the protein sequence generated by the protein sequence generating template meets the requirement can be judged by using the protein sequence recognition model, and the prompting information is used for prompting, so that the accuracy of the output result of the protein sequence recognition model can be improved.
In order to achieve the above embodiment, the embodiment of the present application further provides a training device for a protein sequence generation model. Fig. 8 is a schematic structural diagram of a training device for generating a model of a protein sequence according to an embodiment of the application.
As shown in fig. 8, the training apparatus 800 for generating a protein sequence generation model includes:
a first obtaining module 810, configured to obtain a first training sample corresponding to a target scene;
a second obtaining module 820, configured to input the first training sample into the generative large language model, so as to obtain a target protein sequence generated by the generative large language model;
the first training module 830 is configured to perform a first training on the generative large language model according to a difference between the target protein sequence and a reference protein sequence corresponding to the first training sample, so as to obtain a protein sequence generative model of the target scene.
In one possible implementation manner of the embodiment of the present application, the first training module 830 is configured to:
determining model loss based on differences between the amino acid symbols at each position in the target protein sequence and the actual amino acid symbols at the same position in the reference protein sequence;
and according to the model loss, performing first training on the large-scale generated language model to obtain a protein sequence generated model of the target scene.
In one possible implementation manner of the embodiment of the present application, the first training module 830 is configured to:
according to the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position, calculating cross entropy loss, and obtaining a first sub-loss corresponding to each position;
And determining model loss according to the first sub-loss corresponding to each position in the target protein sequence.
In one possible implementation manner of the embodiment of the present application, the first training module 830 is configured to:
determining a second loss corresponding to each position according to the square of the difference between the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position and the first tag value and the square of the difference between the probability that the amino acid symbol at each position belongs to other amino acid symbols and the second tag value;
and determining model loss according to the second sub-loss corresponding to each position in the target protein sequence.
In one possible implementation manner of the embodiment of the present application, the first obtaining module 810 is configured to:
acquiring description information of a target scene;
acquiring target characters according to the description information of the target scene; wherein the target character represents a start position of the generated protein sequence;
acquiring a first training sample according to the target character; wherein the first training sample is used to prompt the random generation of protein sequences.
In one possible implementation manner of the embodiment of the present application, the first obtaining module 810 is configured to:
Acquiring description information of a target scene;
acquiring a sample protein sequence of a first target type according to the description information of the target scene;
acquiring a first training sample according to a sample protein sequence of a first target type; wherein the first training sample is used to prompt the generation of a second target type of protein sequence that binds to the first target type of sample protein sequence.
In one possible implementation manner of the embodiment of the present application, the first obtaining module 810 is configured to:
acquiring description information of a target scene;
according to the description information of the target scene, acquiring a sample protein sequence of a first target type and a sample protein sequence of a second target type;
splicing the sample protein sequence of the first target type and the sample protein sequence of the second target type to obtain a first training sample;
wherein the first training sample is used to prompt generation of a new protein sequence of a second target type from the sample protein sequence of the first target type and the sample protein sequence of the second target type.
In one possible implementation manner of the embodiment of the present application, the first obtaining module 810 is configured to:
acquiring description information of a target scene;
Acquiring first text information according to the description information of the target scene; the first text information is used for describing requirements which are required to be met by the generated protein sequence;
acquiring a first training sample according to the first text information; wherein the first training sample is used for prompting the generation of a protein sequence meeting the requirements.
In one possible implementation manner of the embodiment of the present application, the first obtaining module 510 is configured to:
acquiring a sample protein sequence according to the description information of the target scene;
generating second text information according to the sample protein sequence and the first text information;
and taking the second text information as a first training sample.
In one possible implementation manner of the embodiment of the present application, the second obtaining module 820 is configured to:
encoding the first training sample to obtain encoding characteristics;
and decoding the coding characteristic and the current output amino acid symbol of the decoder generating the large language model to obtain the next amino acid symbol output by the decoder until the decoder outputs an end character, thereby obtaining the target protein sequence.
In one possible implementation manner of the embodiment of the present application, the apparatus may further include:
the third acquisition module is used for acquiring second training samples corresponding to a plurality of scenes respectively, wherein the second training samples corresponding to at least one scene in the plurality of scenes comprise protein sequences;
And the second training module is used for carrying out second training on the initial generation type large language model by using second training samples corresponding to the scenes respectively to obtain the generation type large language model.
The explanation of the training method embodiment of the protein sequence generation model is also applicable to the training device of the protein sequence generation model of this embodiment, and therefore will not be repeated here.
In the embodiment of the application, the generated large language model can be subjected to first training according to the first training sample corresponding to the target scene to obtain the protein sequence generated model of the target scene. Therefore, the generated large language model is finely adjusted according to training samples of different scenes, so that protein sequence generated models of different scenes can be obtained, protein designs of different scenes are unified under the generated large language model, different structural models are not required to be designed for different scenes, and applicability is high.
In order to achieve the above embodiments, the embodiments of the present application further provide a protein sequence generating device. Fig. 9 is a schematic structural diagram of a protein sequence generating apparatus according to an embodiment of the present application.
As shown in fig. 9, the protein sequence generating apparatus 900 includes:
A first obtaining module 910, configured to obtain model input data corresponding to a target scene;
the second obtaining module 920 is configured to input the model input data to a protein sequence generating model of the target scene to obtain a target protein sequence generated by the protein sequence generating model, where the protein sequence generating model is trained by using the training method of the protein sequence generating model in the above embodiment.
In one possible implementation manner of the embodiment of the present application, the apparatus may include:
the third acquisition module is used for acquiring a prompt template, wherein prompt information in the prompt template is used for prompting the protein sequence identification model to judge whether the target protein sequence meets the requirement;
the splicing module is used for splicing the target protein sequence with the prompt template to obtain a spliced text;
the fourth acquisition module is used for inputting the spliced text into the protein sequence recognition model so as to acquire a recognition result output by the protein sequence recognition model;
and the determining module is used for determining whether the target protein sequence meets the requirement according to the identification result.
In one possible implementation manner of the embodiment of the present application, a third obtaining module is configured to:
Obtaining a sample protein sequence and the type of the sample protein sequence;
and generating a prompt template according to the sequence and the type of the sample protein.
The explanation of the embodiment of the protein sequence generating method is also applicable to the protein sequence generating apparatus of this embodiment, and therefore, will not be repeated here.
In the embodiment of the application, the model input data corresponding to the target scene can be obtained, and the model input data is input into the protein sequence generation model to obtain the target protein sequence generated by the protein sequence generation model. Thus, the protein sequence required for different scenes can be generated by using the protein sequence generation model for different scenes.
According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.
FIG. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 1002 or a computer program loaded from a storage unit 1008 into a RAM (Random Access Memory ) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An I/O (Input/Output) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a CPU (Central Processing Unit ), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The calculation unit 1001 performs the respective methods and processes described above, for example, a training method of a protein sequence generation model. For example, in some embodiments, the training method of the protein sequence generation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the protein sequence generation model described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the training method of the protein sequence generation model in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server, virtual special servers) are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that the electronic device used in the protein sequence generating method according to the embodiment of the present application has a similar structure to the above electronic device, and thus will not be described herein.
According to an embodiment of the present application, the present application further provides a computer program product, which when executed by an instruction processor in the computer program product, performs the training method of the protein sequence generation model or the protein sequence generation method set forth in the above embodiment of the present application.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims (30)

1. A method of training a protein sequence generation model, comprising:
acquiring a first training sample corresponding to a target scene;
inputting the first training sample into a generated large language model to obtain a target protein sequence generated by the generated large language model;
and performing first training on the large-scale generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generation model of the target scene.
2. The method of claim 1, wherein the performing a first training on the large-scale model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain the protein sequence generation model of the target scene comprises:
determining model loss based on differences between amino acid symbols at each position in the target protein sequence and real amino acid symbols at the same position in the reference protein sequence;
and according to the model loss, performing first training on the generated large language model to obtain a protein sequence generation model of the target scene.
3. The method of claim 2, wherein said determining model loss based on differences between amino acid symbols at each position in the target protein sequence and true amino acid symbols at the same position in the reference protein sequence comprises:
calculating cross entropy loss according to the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position, so as to obtain a first sub-loss corresponding to each position;
and determining the model loss according to the first sub-loss corresponding to each position in the target protein sequence.
4. The method of claim 1, wherein the performing a first training on the large-scale model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain the protein sequence generation model of the target scene comprises:
determining a second loss corresponding to each position according to the square of the difference between the probability of the amino acid symbol at each position in the target protein sequence belonging to the real amino acid symbol corresponding to each position and the first tag value and the square of the difference between the probability of the amino acid symbol at each position belonging to other amino acid symbols and the second tag value;
And determining the model loss according to the second sub-loss corresponding to each position in the target protein sequence.
5. The method of claim 1, wherein the obtaining a first training sample corresponding to a target scene comprises:
acquiring description information of the target scene;
acquiring target characters according to the description information of the target scene; wherein the target character represents a start position of the generated protein sequence;
acquiring the first training sample according to the target character; wherein the first training sample is used to prompt randomly generated protein sequences.
6. The method of claim 1, wherein the obtaining a first training sample corresponding to a target scene comprises:
acquiring description information of the target scene;
acquiring a sample protein sequence of a first target type according to the description information of the target scene;
acquiring the first training sample according to the sample protein sequence of the first target type; wherein the first training sample is used to prompt the generation of a second target type of protein sequence that binds to the first target type of sample protein sequence.
7. The method of claim 1, wherein the obtaining a first training sample corresponding to a target scene comprises:
Acquiring description information of the target scene;
according to the description information of the target scene, a sample protein sequence of a first target type and a sample protein sequence of a second target type are obtained;
splicing the sample protein sequence of the first target type and the sample protein sequence of the second target type to obtain the first training sample;
wherein the first training sample is used to prompt generation of a new protein sequence of the second target type from the sample protein sequence of the first target type and the sample protein sequence of the second target type.
8. The method of claim 1, wherein the obtaining a first training sample corresponding to a target scene comprises:
acquiring description information of the target scene;
acquiring first text information according to the description information of the target scene; the first text information is used for describing requirements to be met by the generated protein sequence;
acquiring the first training sample according to the first text information; the first training sample is used for prompting generation of a protein sequence meeting requirements.
9. The method of claim 8, wherein the obtaining the first training sample from the first text information comprises:
Acquiring a sample protein sequence according to the description information of the target scene;
generating second text information according to the sample protein sequence and the first text information;
and taking the second text information as the first training sample.
10. The method of any of claims 1-9, wherein the inputting the first training sample into a generative large language model to obtain a target protein sequence generated by the generative large language model comprises:
encoding the first training sample to obtain encoding characteristics;
and decoding the coding feature and the current output amino acid symbol of the decoder of the generated large language model to obtain the next amino acid symbol output by the decoder until the decoder outputs an end character, so as to obtain the target protein sequence.
11. The method of any of claims 1-9, wherein the generative large language model is trained by:
acquiring second training samples corresponding to a plurality of scenes respectively, wherein the second training samples corresponding to at least one scene in the plurality of scenes comprise protein sequences;
And performing second training on the initial generation type large language model by using second training samples corresponding to the scenes respectively to obtain the generation type large language model.
12. A method of generating a protein sequence, comprising:
obtaining model input data corresponding to a target scene;
inputting the model input data into a protein sequence generation model of the target scene to obtain a target protein sequence generated by the protein sequence generation model, wherein the protein sequence generation model is trained by the method according to any one of claims 1-11.
13. The method of claim 12, further comprising:
acquiring a prompt template, wherein prompt information in the prompt template is used for prompting a protein sequence identification model to judge whether the target protein sequence meets the requirement;
splicing the target protein sequence with the prompt template to obtain a spliced text;
inputting the spliced text into the protein sequence recognition model to obtain a recognition result output by the protein sequence recognition model;
and determining whether the target protein sequence meets the requirement according to the identification result.
14. The method of claim 13, wherein the obtaining a hint template comprises:
obtaining a sample protein sequence and a type of the sample protein sequence;
generating the hint template according to the sample protein sequence and the type.
15. A training device for protein sequence generation models, comprising:
the first acquisition module is used for acquiring a first training sample corresponding to the target scene;
the second acquisition module is used for inputting the first training sample into a large generative language model so as to acquire a target protein sequence generated by the large generative language model;
and the first training module is used for carrying out first training on the large generated language model according to the difference between the target protein sequence and the reference protein sequence corresponding to the first training sample to obtain a protein sequence generation model of the target scene.
16. The apparatus of claim 15, wherein the first training module is to:
determining model loss based on differences between amino acid symbols at each position in the target protein sequence and real amino acid symbols at the same position in the reference protein sequence;
And according to the model loss, performing first training on the generated large language model to obtain a protein sequence generation model of the target scene.
17. The apparatus of claim 16, wherein the first training module is to:
calculating cross entropy loss according to the probability that the amino acid symbol at each position in the target protein sequence belongs to the real amino acid symbol corresponding to each position, so as to obtain a first sub-loss corresponding to each position;
and determining the model loss according to the first sub-loss corresponding to each position in the target protein sequence.
18. The apparatus of claim 15, wherein the first training module is to:
determining a second loss corresponding to each position according to the square of the difference between the probability of the amino acid symbol at each position in the target protein sequence belonging to the real amino acid symbol corresponding to each position and the first tag value and the square of the difference between the probability of the amino acid symbol at each position belonging to other amino acid symbols and the second tag value;
and determining the model loss according to the second sub-loss corresponding to each position in the target protein sequence.
19. The apparatus of claim 15, wherein the first acquisition module is configured to:
acquiring description information of the target scene;
acquiring target characters according to the description information of the target scene; wherein the target character represents a start position of the generated protein sequence;
acquiring the first training sample according to the target character; wherein the first training sample is used to prompt randomly generated protein sequences.
20. The apparatus of claim 15, wherein the first acquisition module is configured to:
acquiring description information of the target scene;
acquiring a sample protein sequence of a first target type according to the description information of the target scene;
acquiring the first training sample according to the sample protein sequence of the first target type; wherein the first training sample is used to prompt the generation of a second target type of protein sequence that binds to the first target type of sample protein sequence.
21. The apparatus of claim 15, wherein the first acquisition module is configured to:
acquiring description information of the target scene;
according to the description information of the target scene, a sample protein sequence of a first target type and a sample protein sequence of a second target type are obtained;
Splicing the sample protein sequence of the first target type and the sample protein sequence of the second target type to obtain the first training sample;
wherein the first training sample is used to prompt generation of a new protein sequence of the second target type from the sample protein sequence of the first target type and the sample protein sequence of the second target type.
22. The apparatus of claim 15, wherein the first acquisition module is configured to:
acquiring description information of the target scene;
acquiring first text information according to the description information of the target scene; the first text information is used for describing requirements to be met by the generated protein sequence;
acquiring the first training sample according to the first text information; the first training sample is used for prompting generation of a protein sequence meeting requirements.
23. The apparatus of claim 22, wherein the first acquisition module is configured to:
acquiring a sample protein sequence according to the description information of the target scene;
generating second text information according to the sample protein sequence and the first text information;
And taking the second text information as the first training sample.
24. The apparatus of any of claims 15-23, wherein the second acquisition module is configured to:
encoding the first training sample to obtain encoding characteristics;
and decoding the coding feature and the current output amino acid symbol of the decoder of the generated large language model to obtain the next amino acid symbol output by the decoder until the decoder outputs an end character, so as to obtain the target protein sequence.
25. The apparatus of any of claims 15-23, further comprising:
the third acquisition module is used for acquiring second training samples corresponding to a plurality of scenes respectively, wherein the second training samples corresponding to at least one scene in the plurality of scenes comprise protein sequences;
and the second training module is used for carrying out second training on the initial generation type large language model by utilizing second training samples corresponding to the scenes respectively to obtain the generation type large language model.
26. A protein sequence generating apparatus comprising:
the first acquisition module is used for acquiring model input data corresponding to a target scene;
A second obtaining module, configured to input the model input data into a protein sequence generating model of the target scene, so as to obtain a target protein sequence generated by the protein sequence generating model, where the protein sequence generating model is obtained by training using the apparatus according to any one of claims 15-25.
27. The apparatus of claim 26, further comprising:
the third acquisition module is used for acquiring a prompt template, wherein prompt information in the prompt template is used for prompting a protein sequence identification model to judge whether the target protein sequence meets the requirement;
the splicing module is used for splicing the target protein sequence with the prompt template to obtain a spliced text;
the fourth acquisition module is used for inputting the spliced text into the protein sequence recognition model so as to acquire a recognition result output by the protein sequence recognition model;
and the determining module is used for determining whether the target protein sequence meets the requirement according to the identification result.
28. The apparatus of claim 27, wherein the third acquisition module is configured to:
obtaining a sample protein sequence and a type of the sample protein sequence;
Generating the hint template according to the sample protein sequence and the type.
29. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11 or to perform the method of any one of claims 12-14.
30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-11 or to perform the method of any one of claims 12-14.
CN202311102676.7A 2023-06-25 2023-08-29 Training method and device for protein sequence generation model and electronic equipment Pending CN117174177A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310753591 2023-06-25
CN2023107535919 2023-06-25

Publications (1)

Publication Number Publication Date
CN117174177A true CN117174177A (en) 2023-12-05

Family

ID=88944131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311102676.7A Pending CN117174177A (en) 2023-06-25 2023-08-29 Training method and device for protein sequence generation model and electronic equipment

Country Status (1)

Country Link
CN (1) CN117174177A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN113539374A (en) * 2021-06-29 2021-10-22 深圳先进技术研究院 Method, device, medium and apparatus for generating protein sequence of high-thermal-stability enzyme
CN114036300A (en) * 2021-11-18 2022-02-11 阳光保险集团股份有限公司 Language model training method and device, electronic equipment and storage medium
CN114898811A (en) * 2022-05-26 2022-08-12 清华大学 Training method and device of protein training model, electronic equipment and storage medium
CN115114439A (en) * 2022-08-30 2022-09-27 北京百度网讯科技有限公司 Method and device for multi-task model reasoning and multi-task information processing
CN115280417A (en) * 2019-12-12 2022-11-01 贾斯特-埃沃泰克生物制品有限公司 Generating protein sequences based on template protein sequences using machine learning techniques
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115795009A (en) * 2022-11-24 2023-03-14 北京智谱华章科技有限公司 Cross-language question-answering system construction method and device based on generating type multi-language model
CN115994522A (en) * 2023-02-02 2023-04-21 阿里巴巴(中国)有限公司 Text processing method, article generating method and text processing model training method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115280417A (en) * 2019-12-12 2022-11-01 贾斯特-埃沃泰克生物制品有限公司 Generating protein sequences based on template protein sequences using machine learning techniques
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN113539374A (en) * 2021-06-29 2021-10-22 深圳先进技术研究院 Method, device, medium and apparatus for generating protein sequence of high-thermal-stability enzyme
CN114036300A (en) * 2021-11-18 2022-02-11 阳光保险集团股份有限公司 Language model training method and device, electronic equipment and storage medium
CN114898811A (en) * 2022-05-26 2022-08-12 清华大学 Training method and device of protein training model, electronic equipment and storage medium
CN115114439A (en) * 2022-08-30 2022-09-27 北京百度网讯科技有限公司 Method and device for multi-task model reasoning and multi-task information processing
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115795009A (en) * 2022-11-24 2023-03-14 北京智谱华章科技有限公司 Cross-language question-answering system construction method and device based on generating type multi-language model
CN115994522A (en) * 2023-02-02 2023-04-21 阿里巴巴(中国)有限公司 Text processing method, article generating method and text processing model training method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEXEY STROKACH.ET AL: "Deep generative modeling for protein design", 《CURR OPIN STRUCT BIOL》, 31 December 2022 (2022-12-31), pages 226 - 236 *
伍青林等: "生成模型在蛋白质序列设计中的应用", 《应用化学》, vol. 39, no. 1, 31 December 2022 (2022-12-31), pages 3 - 17 *

Similar Documents

Publication Publication Date Title
CN115309877B (en) Dialogue generation method, dialogue model training method and device
EP4131076A1 (en) Serialized data processing method and device, and text processing method and device
CN113450759A (en) Voice generation method, device, electronic equipment and storage medium
CN113590776A (en) Text processing method and device based on knowledge graph, electronic equipment and medium
CN116166827B (en) Training of semantic tag extraction model and semantic tag extraction method and device
CN114841274B (en) Language model training method and device, electronic equipment and storage medium
CN115688920A (en) Knowledge extraction method, model training method, device, equipment and medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN112559715B (en) Attitude identification method, device, equipment and storage medium
EP3843090B1 (en) Method and apparatus for outputting analysis abnormality information in spoken language understanding
CN117038099A (en) Medical term standardization method and device
CN112580666A (en) Image feature extraction method, training method, device, electronic equipment and medium
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
CN114758649B (en) Voice recognition method, device, equipment and medium
CN114970666B (en) Spoken language processing method and device, electronic equipment and storage medium
CN116049370A (en) Information query method and training method and device of information generation model
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN114141236B (en) Language model updating method and device, electronic equipment and storage medium
CN116187301A (en) Model generation method, entity identification device, electronic equipment and storage medium
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN117174177A (en) Training method and device for protein sequence generation model and electronic equipment
CN113553413A (en) Dialog state generation method and device, electronic equipment and storage medium
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN114297380A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination