CN112786108A

CN112786108A - Molecular understanding model training method, device, equipment and medium

Info

Publication number: CN112786108A
Application number: CN202110082654.3A
Authority: CN
Inventors: 李宇琨; 张涵; 肖东凌; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-11
Anticipated expiration: 2041-01-21
Also published as: CN112786108B

Abstract

The disclosure discloses a training method, a training device, equipment and a medium of a molecular understanding model, and relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing and deep learning. The training method comprises the following steps: obtaining pre-training data, the pre-training data comprising: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule; processing the first molecular representation sequence sample by using the molecular understanding model to obtain a pre-training output; and calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function. The present disclosure can improve the molecular understanding effect of the molecular understanding model.

Description

Molecular understanding model training method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to artificial intelligence technologies such as natural language processing and deep learning, and more particularly, to a method, an apparatus, a device, and a medium for training a molecular understanding model.

Background

Artificial Intelligence (AI) is a subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both hardware-level and software-level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

The Simplified Molecular Input Line Entry Specification (SMILES) is a Specification that explicitly describes molecules using American Standard Code for Information Interchange (ASCII) strings. Based on SMILES, a molecule can be represented as one or more SMILES sequences. With the development of deep learning technology, the deep learning technology can be applied to the field of physical chemistry.

In the related art, in the molecular understanding, based on a single SMILES sequence of a molecule, a Bidirectional Transformer Encoder (BERT) Model is used, and a Mask Language Model (MLM) task is used for training.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and medium for training a molecular understanding model.

According to an aspect of the present disclosure, there is provided a method for training a molecular understanding model, including: obtaining pre-training data, the pre-training data comprising: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule; processing the first molecular representation sequence sample by using the molecular understanding model to obtain a pre-training output; and calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

According to another aspect of the present disclosure, there is provided a molecular processing method based on a molecular model, the molecular model including a molecular understanding model and an output network, the molecular understanding model being obtained by representing sequence samples using two different molecules of a same molecule, the molecular processing method including: processing a molecular application input by using the molecular understanding model to obtain a hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network is a molecular generation network; and processing the hidden layer output by adopting the output network to obtain the molecular application output.

According to another aspect of the present disclosure, there is provided a training apparatus for a molecular understanding model, including: an obtaining module, configured to obtain pre-training data, where the pre-training data includes: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule; the processing module is used for processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain pre-training output; and the updating module is used for calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating the parameters of the molecular understanding model according to the pre-training loss function.

According to another aspect of the present disclosure, there is provided a molecular processing apparatus based on a molecular model including a molecular understanding model and an output network, the molecular processing apparatus including: a first processing module, configured to process a molecular application input by using the molecular understanding model to obtain a hidden layer output, where the molecular application input includes a fixed identifier when the output network is a molecular generation network; and the second processing module is used for processing the hidden layer output by adopting the output network so as to obtain the molecular application output.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme disclosed by the invention, the molecular understanding effect of the molecular understanding model can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a ninth embodiment of the present disclosure;

FIG. 10 is a schematic diagram according to a tenth embodiment of the present disclosure;

FIG. 11 is a schematic diagram according to an eleventh embodiment of the present disclosure;

fig. 12 is a schematic diagram of an electronic device for implementing either a molecular understanding model training method or a molecular processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When the deep learning technology is applied to the field of physical chemistry, molecules of a large number of compounds can be converted into SMILES (chemical-induced interference) sequences, the SMILES sequences are input into a BERT model like texts for pre-training, a pre-training model is trained, and then the pre-training model can be finely tuned (fine-tuning) based on downstream molecular tasks.

In the related art, a single SMILES sequence of a molecule is input into a BERT model, and pre-training is performed based on an MLM task to obtain a pre-training model for molecular understanding. The single SMILES sequence is adopted, and the characteristics of the SMILES sequence are not fully utilized, so that the molecular understanding effect of the pre-trained molecular understanding model is poor.

In order to solve the problem of poor molecular understanding effect of the molecular understanding model existing in the related art, the present disclosure provides some examples as follows.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a training method of a molecular understanding model, which comprises the following steps:

101. obtaining pre-training data, the pre-training data comprising: the first molecule represents a sequence sample and the second molecule represents a sequence sample, and the first molecule represents a sequence sample and the second molecule represents a sequence sample which are two different molecules of the same molecule.

102. Processing the first molecular representation sequence sample using the molecular understanding model to obtain a pre-training output.

103. And calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

A molecule is a physicochemical term, a molecule being the smallest unit of a substance that can exist independently, is relatively stable, and retains the physicochemical properties of the substance. Molecules are composed of atoms that are combined into molecules by certain forces in a certain order and arrangement, which may be referred to as a molecular structure. Thus, a molecule can be characterized by atomic and molecular structure, and the physicochemical properties of a molecule depend not only on the type and number of atoms that make up the molecule, but also on the molecular structure.

Natural Language Understanding (NLU) is an important component of Natural Language Processing (NLP), and the core task of NLU is to convert Natural Language into a formal Language that can be processed by a machine, and establish connection between Natural Language and the machine.

Similar to natural language understanding, molecular understanding means that a sequence of molecular representations is converted into a molecular understanding representation, and the molecular understanding representation is a representation that can be processed by a machine, for example, the molecular understanding representation specifically includes probability distribution vectors corresponding to time steps, and the i (i ═ 1.. n) th element of the probability distribution vector is the probability of the i-th word in a word list, where n is the dimension of the word list.

In some embodiments, the molecular representation sequence is a SMILES sequence. By adopting the SMILES sequence, the characteristics that the same molecule corresponds to a plurality of SMILES sequences can be fully utilized, and compared with a mode of understanding the molecule by a single SMILES sequence, the molecule can be better understood by different SMILES sequences of the same molecule, so that the molecule understanding effect of the molecule understanding model is improved.

Based on the SMILES, different SMILES sequences of the same molecule can be obtained, for example, referring to fig. 2, corresponding to the same molecule 201, and based on the SMILES, a plurality of SMILES sequences 202 corresponding to the molecule 201 can be obtained.

Further, two different SMILES sequences can be randomly acquired among the plurality of SMILES sequences in a random manner. For example, two SMILES sequences captured corresponding to the molecule 201 shown in fig. 2 may be the first and third, i.e.: CC (Oc1cccc 1C (═ O) ═ O, and C (C1C (cccc1) Oc (═ O) C) (═ O) O.

To distinguish from the application phase, the data used in the training phase may be referred to as samples, e.g., the application phase is referred to as a molecular representation sequence and the training phase is referred to as a molecular representation sequence sample. Therefore, in the training stage, the above-mentioned manner may be adopted to obtain two different molecule representation sequence samples corresponding to the same molecule, where the two different molecule representation sequence samples may be referred to as a first molecule representation sequence sample and a second molecule representation sequence sample.

After a first molecular representation sequence sample and a second molecular representation sequence sample are obtained, when the first molecular representation sequence sample can be input into a molecular understanding model, initially, the molecular understanding model processes the first molecular representation sequence sample by using initial parameters, the output of the molecular understanding model is called pre-training output, then, a pre-training loss function can be calculated based on the pre-training output and the second molecular representation sequence sample, parameters of the molecular understanding model are updated based on the pre-training loss function until the pre-training loss function converges, and the parameters when the pre-training loss function converges are used as final parameters of the molecular understanding model. The pre-training loss function is not limited, and may be, for example, a negative log-likelihood (NLL) function.

In some embodiments, as shown in fig. 3, the molecular understanding model may include an input layer, which may be an embedding (embedding) layer for converting an input sequence into an input vector, and a hidden layer, which may specifically include an encoder (encoder)301 and a decoder (decoder) 302. Taking the example of the molecular representation sequence as the SMILES sequence, when training a molecular understanding model, a first SMILES sequence sample is converted into an input vector through an embedding layer and then is input into an encoder, and the input vector is processed by the encoder and a decoder to obtain pre-training output. The pre-training output is a probability distribution vector of each time step, and then a pre-training loss function can be calculated based on the pre-training output and an expected output sequence sample corresponding to the time step, namely a second SMILES sequence sample, so that parameters of the molecular understanding model are updated based on the pre-training loss function.

In some embodiments, the encoder includes a first self-attention (self-attention) layer that employs a bi-directional self-attention mechanism; and/or the decoder comprises a second self-attention layer, wherein the second self-attention layer adopts a unidirectional self-attention mechanism.

The encoder adopts a bidirectional self-attention mechanism, the decoder adopts a unidirectional self-attention mechanism, different self-attention mechanisms can be adopted according to different inputs, more flexibility is realized, and the molecular understanding effect of the molecular understanding model can be improved.

In some embodiments, the encoder further comprises a first shared network, and the decoder further comprises a second shared network, the first shared network and the second shared network having the same network structure and network parameters.

By adopting a shared network by the encoder and the decoder, the same characteristics can be better utilized in the encoding and decoding processes, and the molecular understanding effect of the molecular understanding model is improved.

For example, referring to fig. 4, the encoder and the decoder may be implemented based on a Transformer network, and in fig. 4, the encoder and the decoder are represented by including a plurality of Transformer layers (Transformer layers), and the structure of each Transformer layer corresponding to the encoder is, for example, the structure of each encoder of the Transformer network. The structures of the transform layers are the same, in each transform layer, the decoder and the encoder are similar in structure, that is, each of the decoder and the encoder may include a self-attention layer and a shared network, and for the sake of distinction, the network in the encoder may be referred to as a first attention layer and a first shared network, and the network in the decoder may be referred to as a second attention layer and a second shared network, except that, as shown in fig. 4, the first self-attention layer 401 in the encoder is a bidirectional self-attention layer, the second self-attention layer 402 in the decoder is a unidirectional self-attention layer, and the shared networks of the two, that is, the first shared network and the second shared network, may be both forward feedback (forward feedback) layers of the encoder of the transform network.

The sequence refers to a combination of a plurality of sequence units, which may be different according to different application scenarios, for example, in the field of NLP in chinese, a sequence unit may refer to each word in chinese.

In the field of physicochemical technology to which embodiments of the present disclosure relate, sequence units may refer to characters characterizing molecules, for example, corresponding to a SMILES sequence, and sequence units are ASCII characters, specifically C, O shown in fig. 2.

In outputting the sequence, the output may be output sequence unit by sequence unit, for example, corresponding to A, B, C three characters, and may be output at a first time step, a second time step, B, and a third time step, C. In the embodiment of the present disclosure, when outputting the current character, the character that has been output before may be used for outputting, for example, when outputting the character B, the character may be output based on the character a that has been output, and when outputting the character C, the character may be output based on the character a and the character B.

Accordingly, in some embodiments, processing the first molecular representation sequence samples using the molecular understanding model to obtain a pre-training output comprises: performing bi-directional self-attention processing on the first molecular representation sequence samples by using the first self-attention layer of the encoder to obtain a bi-directional self-attention processing result; processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain an encoded output; performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain a unidirectional self-attention processing result; processing the one-way self-attention processing result using the second shared network portion of the decoder to obtain the pre-training output.

For example, referring to fig. 5, an embedding layer 501 is used to convert a first SMILES sequence sample into a first input vector, the first input vector is sequentially processed by a first attention layer and a first shared network of an encoder 502, and then outputs a coded vector to a decoder, another input of the decoder is a second input vector obtained by converting an output sequence through the embedding layer 501, the coded vector and the second input vector are sequentially processed by a second attention layer and a second shared network of a decoder 503, and then outputs a pre-training output, where the pre-training output may specifically be a probability distribution vector. The currently generated sequence units may then be determined based on the probability distribution vector as generated output for a subsequent time step, passed through the embedding layer and input to the decoder, and so on, generating sequence units one by one until the end of generating the terminator.

Through the generation process of the pre-training output, the accuracy of the pre-training output can be improved, and the molecular understanding effect of the molecular understanding model is further improved.

In this embodiment, two different molecules of the same molecule are used to represent the sequence sample training molecule understanding model, so that the characteristics of the molecule representation sequence can be fully utilized, and the molecule understanding effect of the molecule understanding model can be improved, compared with a method of using a single molecule to represent the sequence sample training model.

In any of the above embodiments or combinations of the above embodiments, the pre-training process of the molecular understanding model is described, so that the molecular understanding model can be used as a pre-training model, and then the pre-training model can be fine-tuned to obtain a fine-tuned model, and the fine-tuned model can be used in a downstream molecular processing task. The trimmed model can be called a molecular model, and a training process of the molecular model is explained below.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. The present embodiment provides a training method of a molecular model, as shown in fig. 6, the training method includes:

601. fine training data is obtained.

602. And fine-tuning the molecular understanding model by adopting the fine-tuning training data to obtain the molecular model.

Wherein, the molecular understanding model can be obtained by adopting any embodiment of the above training.

Based on the differences in the molecular processing tasks, the fine-tuning training data can be selected accordingly.

In some embodiments, the molecular processing tasks may include: a molecular prediction task, and/or a molecular generation task. Further, molecular prediction tasks may include: a molecular classification task, and/or a molecular regression task. Further, the molecular generation task may include: generating new molecules, generating new molecules with specific properties, generating optimized molecules.

For molecular prediction tasks:

the corresponding fine tuning training data may be referred to as first fine tuning training data, which includes: a first input sample and a first output sample; the first input sample is a molecule representation sequence sample, the first output sample is label data corresponding to the molecule representation sequence sample, and if the prediction is classification, the label data is a classification label; and/or, if the prediction is regression, the label data is a regression label.

The classification labels can be labeled manually according to actual requirements, for example, a labeled DNA sequence or protein is labeled, the labeled protein comprises a labeled seed storage protein, a labeled isozyme, a labeled allelic enzyme and the like, the isozymes are different molecular forms of the enzymes encoded at a plurality of gene loci, and the allelic enzymes are different molecular forms of the enzymes encoded by different alleles of the same gene locus. The regression label can also be manually marked according to actual requirements.

For the molecular generation task:

the corresponding fine tuning training data may be referred to as second fine tuning training data, and the corresponding second fine tuning training data is also different corresponding to the three molecular generation tasks.

Corresponding to the task of molecular generation of new molecules:

the second fine training data includes: a plurality of sets of sample pairs, each set of sample pairs comprising: and the second input sample comprises a fixed identifier, the second output sample is a molecule representation sequence sample, and the molecule representation sequence sample in each group of sample pairs is a molecule representation sequence sample of similar molecules meeting a preset similarity condition. The similar conditions may be set according to actual requirements, for example, molecules with the same atomic composition and the same molecular structure are used as similar molecules, or the like. Wherein, the judgment conditions of similar atomic composition and/or similar molecular structure can also be set according to the actual requirement.

Corresponding to the task of generating molecules with specific properties:

the second fine training data includes: a plurality of sets of sample pairs, each set of sample pairs comprising: and the second input sample comprises a fixed identifier and an attribute sample, the second output sample is a molecule representation sequence sample, and the molecule representation sequence sample in each group of sample pairs is a molecule representation sequence sample of similar molecules which have the attribute and meet a preset similar condition. Among these, similar conditions can be found in the new molecule generation task described above. Unlike the task of generating new molecules described above, the task here also requires that the new molecules have specific attributes, and therefore, the input sample also includes an attribute sample, and the selected output sample-corresponding molecule also needs to have an attribute corresponding to the attribute sample. Attributes refer to biological, physical, chemical, etc. properties that a molecule possesses, such as toxicity, activity, etc. In practical application, the attribute values of the attributes may be configured in advance, then one attribute may be selected as an attribute sample, a molecule expression sequence of similar molecules having the selected attribute is used as a second output sample, a vector corresponding to a fixed identifier including attribute sample information is used as an input vector of a molecule understanding model, and a molecule model of a corresponding task is obtained by training using the input vector and the second output sample. The vector corresponding to the fixed identifier containing the attribute sample information can be obtained according to the principle of the subsequent application stage.

Corresponding to the molecular generation task for generating optimized molecules:

the second fine training data includes: a plurality of sets of sample pairs, each set of sample pairs comprising: and the second output sample is an output molecule representation sequence sample, and the output molecule represents the molecule corresponding to the sequence sample and is an optimized molecule of the molecule corresponding to the input molecule representation sequence sample. The optimized molecule may be selected according to the requirement, for example, a molecule having a certain property is used as the optimized molecule of the molecule to be optimized.

In this embodiment, the molecular understanding model is finely adjusted to obtain a molecular model, which can be applied to various downstream molecular tasks, reduce the training workload, and improve the training efficiency.

The above embodiments illustrate the training process of the molecular model based on which molecules can be processed in the application phase to accomplish various molecular processing tasks.

Fig. 7 is a schematic diagram according to a seventh embodiment of the present disclosure, which provides a molecular processing method based on a molecular model, where the molecular model includes a molecular understanding model and an output network, the molecular understanding model is obtained by using two different molecular representation sequence samples of the same molecule, and the processing method includes:

701. processing the molecular application input by using the molecular understanding model to obtain hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network is a molecular generation network.

702. And processing the hidden layer output by adopting the output network to obtain the molecular application output.

The output network is different based on the difference in molecular processing tasks.

For example, the output network is a molecular prediction network corresponding to the molecular prediction task, and the output network is a molecular generation network corresponding to the molecular generation task.

Further, the molecular prediction network and/or the molecular generation network may also be different depending on the specific molecular prediction task and/or the specific molecular generation task.

In addition, the molecular application inputs and molecular application outputs are different based on the difference in molecular processing tasks.

For molecular prediction tasks:

referring to fig. 8, the molecular application inputs are: the molecules to be predicted represent sequences, exemplified by the SMILES sequences in fig. 8. The molecular application output is: and (5) predicting the value. The molecular representation sequence to be predicted may be a single molecular representation sequence, or a plurality of spliced molecular representation sequences.

After the SMILES sequence is input into the molecular understanding model 801, a predicted value corresponding to the SMILES sequence is output through the molecular prediction network 802 as an output network, and the predicted value may be a classification value and/or a regression value.

For the molecular generation task:

the output network is a molecular generation network; the molecular application input comprises: a fixed identifier; the molecular application output comprises: the molecules represent sequences.

Corresponding to the task of molecular generation of new molecules:

referring to the left panel of fig. 9, the molecular application inputs are: a fixed identifier; the molecular application output is: the molecular representation sequence of the new molecule is exemplified by the molecular representation sequence of SMILES in fig. 9. The fixed identifier may be [ CLS ], or the fixed identifier may be a start identifier, or the like, and in addition, the fixed identifier may be one or more, and may include a start identifier and a stop identifier, for example.

After the fixed identifier is input into the molecule understanding model 901, the SMILES sequence of the new molecule is output through the molecule generating network 902 as an output network.

Corresponding to the task of generating molecules with specific properties:

referring to the middle panel of fig. 9, the molecular application inputs are: a fixed identifier and specific attribute information; the molecular application output is: a molecule representing a sequence of a new molecule having the specified property.

After the fixed identifier and the specific attribute information are input into the molecular understanding model 901, the embedding layer may convert the fixed identifier and the specific information into a vector corresponding to the fixed identifier containing the specific attribute information, and then output the SMILES sequence having the specific attribute information through the molecular generation network 902 serving as an output network.

The fixed identifier and the vector corresponding to the fixed identifier containing the specific attribute information are represented by different filling manners in fig. 9, and the vector corresponding to the fixed identifier containing the specific attribute information may be obtained by multiplying the attribute value corresponding to the specific attribute information by the value corresponding to the fixed identifier and converting the product into a vector by using an embedding layer; alternatively, the embedding layer may include: the character embedding layer converts the fixed identifier into a fixed identifier vector, the attribute embedding layer converts the attribute value of the specific attribute information into an attribute vector, and then the fixed identifier vector and the attribute vector are added to obtain the attribute value.

referring to the right panel of fig. 9, the molecular application inputs are: fixing the identifier and the molecule representation sequence to be optimized; the molecular application output is: the optimized molecules represent the sequence.

After the fixed identifier and the SMILES sequence to be optimized are input into the molecular understanding model 901, the optimized SMILES sequence is output through the molecular generation network 902 serving as an output network.

In some embodiments, the processing the hidden layer output with the output network to obtain a molecular application output includes: searching for a molecular application output corresponding to the hidden layer output using the output network, the searching comprising: a random sampling search, or a bundled search.

Further, as shown in the left and middle diagrams of fig. 9, when a new molecule is generated or a new molecule with a specific attribute is generated, random sampling search (random sampling search) may be specifically adopted, so that a wider range of new molecules may be acquired; as shown in the right-hand diagram of fig. 9, when generating optimized molecules, beam search (beam search) may be employed, so that more targeted and accurate optimized molecules may be obtained.

In this embodiment, the molecular model is obtained by fine tuning of the molecular understanding model by using the molecular model, and is applicable to various downstream molecular tasks, and the complexity of molecular generation can be reduced by performing a molecular generation task based on the fixed identifier. In addition, different molecular tasks, such as molecular prediction tasks and/or molecular generation tasks, can be accomplished through different output networks and different molecular application inputs.

Fig. 10 is a schematic diagram according to a tenth embodiment of the present disclosure. The present embodiment provides a training apparatus for a molecular understanding model, as shown in fig. 10, the apparatus 1000 includes: an acquisition module 1001, a processing module 1002 and an update module 1003.

The obtaining module 1001 is configured to obtain pre-training data, where the pre-training data includes: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule; the processing module 1002 is configured to process the first molecular representation sequence sample by using the molecular understanding model to obtain a pre-training output; the updating module 1003 is configured to calculate a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and update a parameter of the molecular understanding model according to the pre-training loss function.

In some embodiments, the molecular understanding model comprises an encoder and a decoder; the encoder comprises a first self-attention layer, wherein the first self-attention layer adopts a bidirectional self-attention mechanism; and/or the decoder comprises a second self-attention layer, wherein the second self-attention layer adopts a unidirectional self-attention mechanism.

In some embodiments, the encoder further comprises a first shared network portion, and the decoder further comprises a second shared network portion, the first shared network portion and the second shared network portion having the same network structure and network parameters.

In some embodiments, the processing module 1002 is specifically configured to: performing bi-directional self-attention processing on the first molecular representation sequence samples by using the first self-attention layer of the encoder to obtain a bi-directional self-attention processing result; processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain an encoded output; performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain a unidirectional self-attention processing result; processing the one-way self-attention processing result using the second shared network portion of the decoder to obtain the pre-training output.

In some embodiments, the first molecular representation sequence sample is a SMILES sequence sample; and/or, the second molecule represents that the sequence sample is a SMILES sequence sample.

In this embodiment, two different molecule expression sequence samples of the same molecule are used to train the molecule understanding model, so that the characteristics of the molecule expression sequence can be fully utilized, and the molecule understanding effect of the molecule understanding model can be improved.

Fig. 11 is a schematic diagram according to an eleventh embodiment of the present disclosure. The present embodiment provides a molecular processing apparatus based on a molecular module, where the molecular model includes a molecular understanding model and an output network, the molecular understanding model is obtained by using two different molecular representation sequence samples of the same molecule, and the molecular processing apparatus 1100 includes: a first processing module 1101 and a second processing module 1102.

The first processing module 1101 is configured to process a molecular application input by using the molecular understanding model to obtain a hidden layer output, where the molecular application input includes a fixed identifier when the output network is a molecular generation network; the second processing module 1102 is configured to process the hidden layer output by using the output network to obtain a molecular application output.

In some embodiments, when the output network is a molecule generating network, the molecule application output comprises a molecule representation sequence, wherein, if the molecule generating network is used to generate a new molecule, the molecule representation sequence is a molecule representation sequence of the new molecule; or, if the molecule generation network is used to generate a new molecule with a specific property, the molecule application input further comprises: information of the specific attribute; the molecule representation sequence is a molecule representation sequence of a new molecule having the specific property; or, if the molecule generation network is used to generate optimized molecules, the molecule application input further comprises: the molecule to be optimized represents the sequence; the molecular representation sequence is an optimized molecular representation sequence.

In some embodiments, the output network is a molecular prediction network; the molecular application input comprises: the molecule to be predicted represents a sequence; the molecular application output comprises: and the molecules to be predicted represent predicted values corresponding to the sequences.

In this embodiment, the molecular model is obtained by fine tuning of the molecular understanding model by using the molecular model, and is applicable to various downstream molecular tasks, and the complexity of molecular generation can be reduced by performing a molecular generation task based on the fixed identifier.

It is understood that the same or corresponding contents in different embodiments of the present disclosure may be mutually referred, and the contents not described in detail in the embodiments may be referred to the related contents in other embodiments.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the electronic apparatus 1200 includes a computing unit 1201, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the electronic apparatus 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the electronic device 1200 are connected to the I/O interface 1205, including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs various methods and processes described above, such as a training method of a molecular understanding model or a molecular processing method. For example, in some embodiments, the training method of the image recognition model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the molecular understanding model training method or the molecular processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured by any other suitable means (e.g., by means of firmware) to perform a molecular processing method or a training method of a molecular understanding model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a molecular understanding model, comprising:

obtaining pre-training data, the pre-training data comprising: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule;

processing the first molecular representation sequence sample by using the molecular understanding model to obtain a pre-training output;

and calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

2. The method of claim 1, wherein,

the molecular understanding model comprises an encoder and a decoder;

the encoder comprises a first self-attention layer, wherein the first self-attention layer adopts a bidirectional self-attention mechanism; and/or the decoder comprises a second self-attention layer, wherein the second self-attention layer adopts a unidirectional self-attention mechanism.

3. The method of claim 2, wherein,

the encoder further comprises a first shared network portion and the decoder further comprises a second shared network portion, the first shared network portion and the second shared network portion having the same network structure and network parameters.

4. The method of claim 3, wherein said processing the first molecular representation sequence samples using the molecular understanding model to obtain a pre-training output comprises:

performing bi-directional self-attention processing on the first molecular representation sequence samples by using the first self-attention layer of the encoder to obtain a bi-directional self-attention processing result;

processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain an encoded output;

performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain a unidirectional self-attention processing result;

processing the one-way self-attention processing result using the second shared network portion of the decoder to obtain the pre-training output.

5. The method of any one of claims 1-4,

the first molecule represents that the sequence sample is a SMILES sequence sample; and/or the presence of a gas in the gas,

the second molecule indicates that the sequence sample is a SMILES sequence sample.

6. A molecular processing method based on a molecular model, wherein the molecular model comprises a molecular understanding model and an output network, the molecular understanding model is obtained by adopting two different molecular representation sequence samples of the same molecule, and the molecular processing method comprises the following steps:

processing a molecular application input by using the molecular understanding model to obtain a hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network is a molecular generation network;

and processing the hidden layer output by adopting the output network to obtain the molecular application output.

7. The method of claim 6, wherein the molecular application output comprises a sequence of molecular representations when the output network is a molecular generation network, wherein,

if the molecule generation network is used to generate a new molecule, the molecule representation sequence is a molecule representation sequence of the new molecule; alternatively, the first and second electrodes may be,

if the molecule generation network is used to generate a new molecule with a specific property, the molecule application input further comprises: information of the specific attribute; the molecule representation sequence is a molecule representation sequence of a new molecule having the specific property; alternatively, the first and second electrodes may be,

if the molecule generation network is used to generate optimized molecules, the molecule application input further comprises: the molecule to be optimized represents the sequence; the molecular representation sequence is an optimized molecular representation sequence.

8. The method of claim 6, wherein,

when the output network is a molecular prediction network, the molecular application input comprises: the molecule to be predicted represents a sequence; the molecular application output comprises: and the molecules to be predicted represent predicted values corresponding to the sequences.

9. A training apparatus for a molecular understanding model, comprising:

an obtaining module, configured to obtain pre-training data, where the pre-training data includes: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule;

the processing module is used for processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain pre-training output;

and the updating module is used for calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating the parameters of the molecular understanding model according to the pre-training loss function.

10. The apparatus of claim 9, wherein,

the molecular understanding model comprises an encoder and a decoder;

11. The apparatus of claim 10, wherein,

12. The apparatus of claim 11, wherein the processing module is specifically configured to:

13. The apparatus of any one of claims 9-12,

14. A molecular processing apparatus based on a molecular model including a molecular understanding model obtained using two different molecular representation sequence samples of the same molecule and an output network, the molecular processing apparatus comprising:

a first processing module, configured to process a molecular application input by using the molecular understanding model to obtain a hidden layer output, where the molecular application input includes a fixed identifier when the output network is a molecular generation network;

and the second processing module is used for processing the hidden layer output by adopting the output network so as to obtain the molecular application output.

15. The apparatus of claim 14, the molecular application output comprising a sequence of molecular representations when the output network is a molecular generation network, wherein,

16. The apparatus of claim 14, wherein,

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of any one of claims 1-5 or the processing method of any one of claims 6-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the training method of any one of claims 1-5 or the processing method of any one of claims 6-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements a training method according to any one of claims 1-5, or a processing method according to any one of claims 6-8.